RARE - Robust Scene Text Recognition with Automatic Rectification

Words List (appearance)
#	word	phonetic	sentence
1	rectification	[ˌrektɪfɪ'keɪʃn]	Robust Scene Text Recognition with Automatic Rectification 具有自动校正的可靠场景文本识别器 Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text. 与文档中的文字不同，自然图像中的文字通常具有不规则形状，这是由透视扭曲，弯曲字符放置等引起的。我们提出了RARE（具有自动重整功能的可靠文本识别器），这是一种对不规则文本具有可靠性的识别模型。 Zhang et al. [42] propose a character rectification method that leverages the low-rank structures of text. 张等人 [42]提出了一种利用文本的低等级结构的字符整理方法。 Moreover, it does not require extra annotations for the rectification process, since the STN is supervised by the SRN during training. 此外，它不需要额外的注释用于整理过程，因为STN在训练期间由SRN监督。 To validate the effectiveness of the rectification scheme, we evaluate RARE on the task of perspective text recognition. 为了验证整合方案的有效性，我们评估了RARE对透视文本识别的任务。 Our rectification scheme can significantly alleviate this problem. 我们的整改计划可以显着缓解这一问题。 Figure 9. Examples showing the rectifications our model makes and the recognition results. 图9. 示例显示了我们的模型所做的纠正和识别结果。 In Fig. 9, we demonstrate the effect of rectification through some examples. 在图9中，我们通过一些例子演示了整改的效果。 Generally, the rectification made by the STN is not perfect, but it alleviates the recognition difficulty to some extent. 一般来说，STN所做的纠正并不完美，但在一定程度上缓解了识别的困难。 Traditional solutions typically use a separate text rectification component. 传统的解决方案通常使用单独的文本校正组件。 The extensive experimental results show that 1) without geometric supervision, the learned model can automatically generate more “readable” images for both human and the sequence recognition network; 2) the proposed text rectification method can significantly improve recognition accuracies on irregular scene text; 3) the proposed scene text recognition system is competitive compared with the state-of-the-arts. 大量的实验结果表明，1)在没有几何监督的情况下，学习模型可以自动为人类和序列识别网络生成更“可读”的图像；2)提出的文本校正方法可以显著提高不规则场景文本的识别准确率；3)与现有技术相比，提出的场景文本识别系统具有竞争力。
2	unsolved	[ˌʌnˈsɒlvd]	Recognizing text in natural images is a challenging task with many unsolved problems. 识别自然图像中的文本是一项具有挑战性的任务，存在许多未解决的问题。
3	placement	[ˈpleɪsmənt]	Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text. 与文档中的文字不同，自然图像中的文字通常具有不规则形状，这是由透视扭曲，弯曲字符放置等引起的。我们提出了RARE（具有自动重整功能的可靠文本识别器），这是一种对不规则文本具有可靠性的识别模型。 Due to its irregular character placement, recognizing curved text is very challenging. 由于其不规则的字符放置，识别弯曲文本是非常具有挑战性的。
4	recognizer	['rekəgnaɪzə]	Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text. 与文档中的文字不同，自然图像中的文字通常具有不规则形状，这是由透视扭曲，弯曲字符放置等引起的。我们提出了RARE（具有自动重整功能的可靠文本识别器），这是一种对不规则文本具有可靠性的识别模型。 Usually, a text recognizer works best when its input images contain tightly-bounded regular text. 通常，文本识别器在其输入图像包含紧密有界的常规文本时效果最佳。 This motivates us to apply a spatial transformation prior to recognition, in order to rectify input images into ones that are more “readable” by recognizers. 这促使我们在识别之前应用空间变换，以便将输入图像校正为识别器更“可读”的图像。 The STN is able to rectify images that contain these types of irregular text, making them more readable for the following recognizer. STN能够纠正包含这些类型的不规则文本的图像，使其对于以下识别器更具可读性。 1, we see that the SRN-only model is also a very competitive recognizer, achieving higher or competitive performance on most of the benchmarks. 1，我们看到仅SRN模型也是一个非常有竞争力的识别器，在大多数基准测试中实现了更高或更具竞争力的性能。 In addition, the spatial transformer network is connected to an attention-based sequence recognizer, allowing us to train the whole model end-to-end. 此外，空间变换器网络连接到基于注意力的序列识别器，允许我们端到端地训练整个模型。
5	transformer	[trænsˈfɔ:mə(r)]	RARE is a specially designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN). RARE是一种特殊设计的深度神经网络，由空间变换网络（STN）和序列识别网络（SRN）组成。 Figure 1. Schematic overview of RARE, which consists a spatial transformer network (STN) and a sequence recognition network (SRN). 图1. RARE的示意图，包括空间变换器网络（STN）和序列识别网络（SRN）。 Specifically, we construct a deep neural network that combines a Spatial Transformer Network [18] (STN) and a Sequence Recognition Network (SRN). 具体而言，我们构建了一个深度神经网络，它结合了空间变换器网络[18]（STN）和序列识别网络（SRN）。 3.1. Spatial Transformer Network 3.1 空间变换网络 Spatial Transformer Network The localization network of STN has 4 convolution layers, each followed by a $2 \times 2$ max-pooling layer. 空间变换器网络STN的定位网络有4个卷积层，每个卷层都有一个$2 \times 2$最大池层。 We address this problem in a more feasible and elegant way by adopting a differentiable spatial transformer network module. 我们通过采用可区分的空间变换网络模块，以一种更可行和更优雅的方式解决了这个问题。 In addition, the spatial transformer network is connected to an attention-based sequence recognizer, allowing us to train the whole model end-to-end. 此外，空间变换器网络连接到基于注意力的序列识别器，允许我们端到端地训练整个模型。
6	STN	[!≈ es ti: en]	RARE is a specially designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN). RARE是一种特殊设计的深度神经网络，由空间变换网络（STN）和序列识别网络（SRN）组成。 Figure 1. Schematic overview of RARE, which consists a spatial transformer network (STN) and a sequence recognition network (SRN). 图1. RARE的示意图，包括空间变换器网络（STN）和序列识别网络（SRN）。 The STN transforms an input image to a rectified image, while the SRN recognizes text. STN将输入图像变换为矫正图像，而SRN识别文本。 Specifically, we construct a deep neural network that combines a Spatial Transformer Network [18] (STN) and a Sequence Recognition Network (SRN). 具体而言，我们构建了一个深度神经网络，它结合了空间变换器网络[18]（STN）和序列识别网络（SRN）。 In the STN, an input image is spatially transformed into a rectified image. 在STN中，输入图像在空间上变换成校正后的图像。 Ideally, the STN produces an image that contains regular text, which is a more appropriate input for the SRN than the original one. 在理想情况下，STN产生的图像是一类常规的文本图像，这比原来的不规则的文本图像更合适输入到SRN中。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN. 因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 In practice, the training eventually makes the STN tend to produce images that contain regular text, which are desirable inputs for the SRN. 在实践中，训练最终会使STN倾向于产生包含常规文本的图像，这些图像正是SRN的理想输入。 Second, our model extends the STN framework [18] with an attention-based model. 第二，我们的模型扩展了以注意为基础的STN框架的模型。 The original STN is only tested on plain convolutional neural networks. 原本的STN仅在普通卷积神经网络上进行测试。 Moreover, it does not require extra annotations for the rectification process, since the STN is supervised by the SRN during training. 此外，它不需要额外的注释用于整理过程，因为STN在训练期间由SRN监督。 The STN transforms an input image I to a rectified image $I^\prime$ with a predicted TPS transformation. STN将输入图像I转换为具有预测的TPS变换的矫正图像$I^\prime$。 A distinctive property of STN is that its sampler is differentiable. STN的一个独特属性是其采样器是可微分的。 Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained. 因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。 Structure of the STN. The localization network localizes a set of fiducial points C, with which the grid generator generates a sampling grid P. The sampler produces a rectified image $I^\prime$ , given I and P. 定位网络定位一组特定点C，网格生成器利用该集合点生成采样网格P.给定I和P时，采样器产生一个矫正的图像$I^\prime$。 Instead, the training of the localization network is completely supervised by the gradients propagated by the other parts of the STN, following the back-propagation algorithm [22]. 相反，定位网络的训练完全受到STN其他部分传播的梯度的监督，遵循反向传播算法[22]。 The STN is able to rectify images that contain these types of irregular text, making them more readable for the following recognizer. STN能够纠正包含这些类型的不规则文本的图像，使其对于以下识别器更具可读性。 Figure 4. The STN rectifies images that contain several types of irregular text. 图4. STN重新构建包含多种不规则文本的图像。 The STN can deal with several types of irregular text, including (a) loosely-bounded text; (b) multi-oriented text; (c) perspective text; (d) curved text. STN可以处理几种类型的不规则文本，包括（a）松散有界的文本; （b）多方面文本; （c）透视文本; （d）弯曲文本。 where the probability $p(\cdot)$ is computed by Eq. 8, $\theta$ is the parameters of both STN and SRN. 其中概率$p(\cdot)$由方程式8计算，$\theta$是STN和SRN的参数。 Spatial Transformer Network The localization network of STN has 4 convolution layers, each followed by a $2 \times 2$ max-pooling layer. 空间变换器网络STN的定位网络有4个卷积层，每个卷层都有一个$2 \times 2$最大池层。 The output size of the STN is also $100 \times 32$. STN的输出大小也是$100 \times 32$。 fiducial points predicted by the STN are plotted on input images in green crosses. 由STN预测的基准点被绘制在绿色十字架的输入图像上。 We see that the STN tends to place fiducial points along upper and lower edges of scene text, and 我们看到STN倾向于沿场景文本的上下边缘放置特定点，并且 Generally, the rectification made by the STN is not perfect, but it alleviates the recognition difficulty to some extent. 一般来说，STN所做的纠正并不完美，但在一定程度上缓解了识别的困难。
7	SRN	[!≈ es ɑ:(r) en]	RARE is a specially designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN). RARE是一种特殊设计的深度神经网络，由空间变换网络（STN）和序列识别网络（SRN）组成。 In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach. 在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。 Figure 1. Schematic overview of RARE, which consists a spatial transformer network (STN) and a sequence recognition network (SRN). 图1. RARE的示意图，包括空间变换器网络（STN）和序列识别网络（SRN）。 The STN transforms an input image to a rectified image, while the SRN recognizes text. STN将输入图像变换为矫正图像，而SRN识别文本。 Specifically, we construct a deep neural network that combines a Spatial Transformer Network [18] (STN) and a Sequence Recognition Network (SRN). 具体而言，我们构建了一个深度神经网络，它结合了空间变换器网络[18]（STN）和序列识别网络（SRN）。 Ideally, the STN produces an image that contains regular text, which is a more appropriate input for the SRN than the original one. 在理想情况下，STN产生的图像是一类常规的文本图像，这比原来的不规则的文本图像更合适输入到SRN中。 Motivated by this, for the SRN we construct an attention-based model [4] that recognizes text in a sequence recognition approach. 受此启发，我们构建了SRN，这是一种在序列识别中采用了注意力的模型。 The SRN consists of an encoder and a decoder. SRN由编码器和解码器构成。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN. 因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 In practice, the training eventually makes the STN tend to produce images that contain regular text, which are desirable inputs for the SRN. 在实践中，训练最终会使STN倾向于产生包含常规文本的图像，这些图像正是SRN的理想输入。 Third, our model adopts a convolutional-recurrent structure in the encoder of the SRN, thus is a novel variant of the attention-based model [4]. 第三，在SRN的编码器中，我们采用卷积循环结构，这是注意力模型的一种新颖的变体。 Moreover, it does not require extra annotations for the rectification process, since the STN is supervised by the SRN during training. 此外，它不需要额外的注释用于整理过程，因为STN在训练期间由SRN监督。 The input to the SRN is a rectified image $I^\prime$ , which ideally contains a word that is written horizontally from left to right. SRN的输入是一个矫正的图像$I^\prime$，理想情况下包含一个从左到右水平写入的单词。 In our model, the SRN is an attention-based model [4, 8], which directly recognizes a sequence from an input image. 在我们的模型中，SRN是一种基于注意力的模型[4,8]，它直接识别来自输入图像的序列。 The SRN consists of an encoder and a decoder. SRN由编码器和解码器组成。 Structure of the SRN, which consists of an encoder and a decoder. The encoder uses several convolution layers (ConvNet) and a two-layer BLSTM network to extract a sequential representation (h) for the input image. 编码器使用几个卷积层（ConvNet）和两层BLSTM网络来提取输入图像的顺序表示（h）。 The SRN directly maps a input sequence to another sequence. SRN直接将输入序列映射到另一个序列。 where the probability $p(\cdot)$ is computed by Eq. 8, $\theta$ is the parameters of both STN and SRN. 其中概率$p(\cdot)$由方程式8计算，$\theta$是STN和SRN的参数。 Sequence Recognition Network In the SRN, the encoder has 7 convolutional layers, whose {filter size, number of filters, stride, padding size} are respectively {3,64,1,1}, {3,128,1,1}, {3,256,1,1}, {3,256,1,1,}, {3,512,1,1}, {3,512,1,1}, and {2,512,1,0}. 序列识别网络在SRN中，编码器有7个卷积层，其{滤波器大小，滤波器数量，步幅，填充大小}分别为{3,64,1,1}，{3,128,1,1}，{3,256 ，1,1}，{3,256,1,1，}，{3,512,1,1}，{3,512,1,1}和{2,512,1,0}。 We also train and test a model that contains only the SRN. 我们还训练和测试仅包含SRN的模型。
8	rectify	[ˈrektɪfaɪ]	In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach. 在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。 The STN transforms an input image to a rectified image, while the SRN recognizes text. STN将输入图像变换为矫正图像，而SRN识别文本。 This motivates us to apply a spatial transformation prior to recognition, in order to rectify input images into ones that are more “readable” by recognizers. 这促使我们在识别之前应用空间变换，以便将输入图像校正为识别器更“可读”的图像。 In the STN, an input image is spatially transformed into a rectified image. 在STN中，输入图像在空间上变换成校正后的图像。 The transformation is a thinplate-spline [6] (TPS) transformation, whose nonlinearity allows us to rectify various types of irregular text, including perspective and curved text. STN的空间变换是一个薄板样条（TPS）变换，这种变换的非线可以纠正各种类型的不规则文本，包括透视和弯曲文本。 Phan et al. propose to explicitly rectify perspective distortions via SIFT [23] descriptor matching. 潘等人建议通过SIFT [23]描述符匹配明确纠正透视失真。 Our method rectifies several types of irregular text in a unified way. 我们的方法以统一的方式重新定义了几种不规则文本。 The STN transforms an input image I to a rectified image $I^\prime$ with a predicted TPS transformation. STN将输入图像I转换为具有预测的TPS变换的矫正图像$I^\prime$。 The sampler takes both the grid and the input image, it produces a rectified image $I^\prime$ by sampling on the grid points. 采样器同时采用网格和输入图像，通过对网格点进行采样，生成一个矫正的图像$I^\prime$。 Structure of the STN. The localization network localizes a set of fiducial points C, with which the grid generator generates a sampling grid P. The sampler produces a rectified image $I^\prime$ , given I and P. 定位网络定位一组特定点C，网格生成器利用该集合点生成采样网格P.给定I和P时，采样器产生一个矫正的图像$I^\prime$。 As illustrated in fig. 3, the base fiducial points are evenly distributed along the top and bottom edge of a rectified image $I^\prime$. 如图3所示，基本金属点沿着矫正图像$I^\prime$的顶部和底部边缘均匀分布。 The grid of pixels on a rectified image $I^\prime$ is denoted by $P^\prime = {p_i^\prime}_{i=1,\cdots,N}$ , where $p_i^\prime = {x_i^\prime, y_i^\prime}^T$ is the x,y coordinates of the i-th pixel, N is the number of pixels. 矫正图像上的像素网格由$P^\prime = {p_i^\prime}_{i=1,\cdots,N}$表示，其中$p_i^\prime = {x_i^\prime, y_i^\prime}^T$是第i个像素的x，y坐标，N是像素数。 By setting all pixel values, we get the rectified image $T^\prime$ : 通过设置所有像素值，我们得到了矫正的图像$T^\prime$： The ﬂexibility of the TPS transformation allows us to transform irregular text images into rectified images that contain regular text. TPS转换的灵活性允许我们将不规则文本图像转换为包含常规文本的矫正图像。 The STN is able to rectify images that contain these types of irregular text, making them more readable for the following recognizer. STN能够纠正包含这些类型的不规则文本的图像，使其对于以下识别器更具可读性。 The input to the SRN is a rectified image $I^\prime$ , which ideally contains a word that is written horizontally from left to right. SRN的输入是一个矫正的图像$I^\prime$，理想情况下包含一个从左到右水平写入的单词。 Figure 4. The STN rectifies images that contain several types of irregular text. 图4. STN重新构建包含多种不规则文本的图像。 The middle column is the rectified images (we use gray-scale images for recognition). 中间一列是校正后的图像(我们使用灰度图像进行识别)。 Our model rectifies images that contain curved text before recognizing them. 我们的模型在识别包含弯曲文本的图像之前对其进行校正。
9	Thin-Plate-Spline	[!≈ θɪn pleɪt splaɪn]	In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach. 在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。
10	TPS	[!≈ ti: pi: es]	In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach. 在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。 The transformation is a thinplate-spline [6] (TPS) transformation, whose nonlinearity allows us to rectify various types of irregular text, including perspective and curved text. STN的空间变换是一个薄板样条（TPS）变换，这种变换的非线可以纠正各种类型的不规则文本，包括透视和弯曲文本。 The TPS transformation is configured by a set of fiducial points, whose coordinates are regressed by a convolutional neural network. TPS变换是由一组基准点决定，这些基准点的坐标就是由STN这个卷积神经网络回归出来的。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN. 因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 The STN transforms an input image I to a rectified image $I^\prime$ with a predicted TPS transformation. STN将输入图像I转换为具有预测的TPS变换的矫正图像$I^\prime$。 Then, inside the grid generator, it calculates the TPS transformation parameters from the fiducial points, and generates a sampling grid on I. 然后，在网格生成器内部，它从各个点计算TPS变换参数，并在I上生成采样网格。 The grid generator estimates the TPS transformation parameters, and generates a sampling grid. 网格生成器估计TPS变换参数，并生成采样网格。 Figure 3. fiducial points and the TPS transformation. 图3.基准点和TPS转换。 The parameters of the TPS transformation is represented by a matrix $T \in \Re^{2 \times (K+3)}$ , which is computed by TPS变换的参数由矩阵表示，其由下式计算 The ﬂexibility of the TPS transformation allows us to transform irregular text images into rectified images that contain regular text. TPS转换的灵活性允许我们将不规则文本图像转换为包含常规文本的矫正图像。
11	readable	[ˈri:dəbl]	In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach. 在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。 This motivates us to apply a spatial transformation prior to recognition, in order to rectify input images into ones that are more “readable” by recognizers. 这促使我们在识别之前应用空间变换，以便将输入图像校正为识别器更“可读”的图像。 The STN is able to rectify images that contain these types of irregular text, making them more readable for the following recognizer. STN能够纠正包含这些类型的不规则文本的图像，使其对于以下识别器更具可读性。 The extensive experimental results show that 1) without geometric supervision, the learned model can automatically generate more “readable” images for both human and the sequence recognition network; 2) the proposed text rectification method can significantly improve recognition accuracies on irregular scene text; 3) the proposed scene text recognition system is competitive compared with the state-of-the-arts. 大量的实验结果表明，1)在没有几何监督的情况下，学习模型可以自动为人类和序列识别网络生成更“可读”的图像；2)提出的文本校正方法可以显著提高不规则场景文本的识别准确率；3)与现有技术相比，提出的场景文本识别系统具有竞争力。
12	trainable	[t'reɪnəbl]	RARE is end-to-end trainable, requiring only images and associated text labels, making it convenient to train and deploy the model in practical systems. RARE是端到端的可训练的，只需要图像和相关的文本标签，便于在实际系统中训练和部署模型。
13	e.g.	[ˌi: ˈdʒi:]	In natural scenes, text appears on various kinds of objects, e.g. road signs, billboards, and product packaging. 在自然场景中，文本出现在各种对象上，例如道路标志，广告牌和产品包装。 However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words. 但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。 In the future, we plan to address the end-to-end scene text reading problem through the combination of RARE with a scene text detection method, e.g. [43]. 未来，我们计划通过将RARE与场景文本检测方法相结合来解决端到端场景文本阅读问题，例如[43]。
14	billboard	[ˈbɪlbɔ:d]	In natural scenes, text appears on various kinds of objects, e.g. road signs, billboards, and product packaging. 在自然场景中，文本出现在各种对象上，例如道路标志，广告牌和产品包装。
15	semantic	[sɪˈmæntɪk]	It carries rich and high-level semantic information that is important for image understanding. 它携带丰富的高级语义信息，这对于图像理解非常重要。
16	real-world	[!≈ ˈri:əl wɜ:ld]	Recognizing text in images facilitates many real-world applications, such as geolocation, driverless car, and image-based machine translation. 识别图像中的文本有助于许多实际应用，例如地理定位，无人驾驶汽车和基于图像的机器翻译。
17	geolocation	[dʒɪɒləʊ'keɪʃn]	Recognizing text in images facilitates many real-world applications, such as geolocation, driverless car, and image-based machine translation. 识别图像中的文本有助于许多实际应用，例如地理定位，无人驾驶汽车和基于图像的机器翻译。
18	driverless	[d'raɪvərles]	Recognizing text in images facilitates many real-world applications, such as geolocation, driverless car, and image-based machine translation. 识别图像中的文本有助于许多实际应用，例如地理定位，无人驾驶汽车和基于图像的机器翻译。
19	side-view	['saɪdvj'u:]	For example, some scene text is perspective text [29], which is caused by side-view camera angles; some has curved shapes, meaning that its characters are placed along curves rather than straight lines. 例如，一些场景文本是透视文本[29]，它是由侧视摄像机角度引起的;有些具有弯曲的形状，这意味着它的角色沿着曲线而不是直线放置。 In fig. 4, we show some common types of irregular text, including a) loosely-bounded text, which resulted by imperfect text detection; b) multi-oriented text, caused by non-horizontal camera views; c) perspective text, caused by side-view camera angles; d) curved text, a commonly seen artistic style. 在图4中，我们展示了一些常见类型的不规则文本，包括a）松散有界的文本，这是由不完美的文本检测引起的; b）由非水平摄像机视图引起的多向文本; c）由侧视摄像机角度引起的透视文本; d）弯曲的文字，一种常见的艺术风格。
20	frontal	[ˈfrʌntl]	We call such text irregular text, in contrast to regular text which is horizontal and frontal. 我们将此类文本称为不规则文本，与常规文本（水平和正面）形成对比。
21	Schematic	[ski:ˈmætɪk]	Figure 1. Schematic overview of RARE, which consists a spatial transformer network (STN) and a sequence recognition network (SRN). 图1. RARE的示意图，包括空间变换器网络（STN）和序列识别网络（SRN）。
22	jointly	[dʒɔɪntlɪ]	The two networks are jointly trained by the back-propagation algorithm [22]. 这两个网络由反向传播算法共同训练[22]。
23	dash	[dæʃ]	The dashed lines represent the ﬂows of the back-propagated gradients. 虚线表示反向传播的梯度的流动。
24	ow	[aʊ]	The dashed lines represent the ﬂows of the back-propagated gradients. 虚线表示反向传播的梯度的流动。
25	back-propagated	[!≈ bæk ˈprɔpəɡeitid]	The dashed lines represent the ﬂows of the back-propagated gradients. 虚线表示反向传播的梯度的流动。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN. 因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。
26	tightly-bounded	[!≈ ˈtaɪtli 'baʊndɪd]	Usually, a text recognizer works best when its input images contain tightly-bounded regular text. 通常，文本识别器在其输入图像包含紧密有界的常规文本时效果最佳。
27	spatially	['speɪʃəlɪ]	In the STN, an input image is spatially transformed into a rectified image. 在STN中，输入图像在空间上变换成校正后的图像。
28	ideally	[aɪ'di:əlɪ]	Ideally, the STN produces an image that contains regular text, which is a more appropriate input for the SRN than the original one. 在理想情况下，STN产生的图像是一类常规的文本图像，这比原来的不规则的文本图像更合适输入到SRN中。 The input to the SRN is a rectified image $I^\prime$ , which ideally contains a word that is written horizontally from left to right. SRN的输入是一个矫正的图像$I^\prime$，理想情况下包含一个从左到右水平写入的单词。
29	thinplate-spline		The transformation is a thinplate-spline [6] (TPS) transformation, whose nonlinearity allows us to rectify various types of irregular text, including perspective and curved text. STN的空间变换是一个薄板样条（TPS）变换，这种变换的非线可以纠正各种类型的不规则文本，包括透视和弯曲文本。
30	nonlinearity	[nɒnlɪnɪ'ærɪtɪ]	The transformation is a thinplate-spline [6] (TPS) transformation, whose nonlinearity allows us to rectify various types of irregular text, including perspective and curved text. STN的空间变换是一个薄板样条（TPS）变换，这种变换的非线可以纠正各种类型的不规则文本，包括透视和弯曲文本。
31	configure	[kənˈfɪgə(r)]	The TPS transformation is configured by a set of fiducial points, whose coordinates are regressed by a convolutional neural network. TPS变换是由一组基准点决定，这些基准点的坐标就是由STN这个卷积神经网络回归出来的。
32	fiducial	[fɪ'dju:ʃjəl]	The TPS transformation is configured by a set of fiducial points, whose coordinates are regressed by a convolutional neural network. TPS变换是由一组基准点决定，这些基准点的坐标就是由STN这个卷积神经网络回归出来的。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN. 因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 As illustrated in fig. 2, it first predicts a set of fiducial points via its localization network. 如图2所示，它首先通过其定位网络预测一组特定点。 Then, inside the grid generator, it calculates the TPS transformation parameters from the fiducial points, and generates a sampling grid on I. 然后，在网格生成器内部，它从各个点计算TPS变换参数，并在I上生成采样网格。 Structure of the STN. The localization network localizes a set of fiducial points C, with which the grid generator generates a sampling grid P. The sampler produces a rectified image $I^\prime$ , given I and P. 定位网络定位一组特定点C，网格生成器利用该集合点生成采样网格P.给定I和P时，采样器产生一个矫正的图像$I^\prime$。 The localization network localizes K fiducial points by directly regressing their $x,y$ -coordinates. 本地化网络通过直接回归其 $x,y$-coordinates来定位K个点。 The coordinates are denoted by $C=[c_1,\cdots, c_K] \in R^{2 \times K}$ , whose k-th column $c_k=[x_k,y_k]^T$ contains the coordinates of the k-th fiducial point. 坐标由 $C=[c_1,\cdots, c_K] \in R^{2 \times K}$表示，其第k列包含第k个特定点的坐标。 The network localizes fiducial points based on global image contexts. 网络基于全局图像上下文定位各个点。 It is expected to capture the overall text shape of an input image, and localizes fiducial points accordingly. 期望捕获输入图像的整体文本形状，并相应地定位各个点。 It should be emphasized that we do not annotate coordinates of fiducial points for any sample. 应该强调的是，我们不会为任何样本注释各个点的坐标。 We first define another set of fiducial points, called the base fiducial points, denoted by $C^\prime=[c_1^\prime,\cdots, c_K^\prime] \in R^{2 \times K}$. 我们首先定义了另一组特殊点，称为基本点，由$C^\prime=[c_1^\prime,\cdots, c_K^\prime] \in R^{2 \times K}$表示。 We first define another set of fiducial points, called the base fiducial points, denoted by $C^\prime=[c_1^\prime,\cdots, c_K^\prime] \in R^{2 \times K}$. 我们首先定义了另一组特殊点，称为基本点，由$C^\prime=[c_1^\prime,\cdots, c_K^\prime] \in R^{2 \times K}$表示。 As illustrated in fig. 3, the base fiducial points are evenly distributed along the top and bottom edge of a rectified image $I^\prime$. 如图3所示，基本金属点沿着矫正图像$I^\prime$的顶部和底部边缘均匀分布。 Figure 3. fiducial points and the TPS transformation. 图3.基准点和TPS转换。 Green markers on the left image are the fiducial points C. 左侧图像上的绿色标记是C点。 Cyan markers on the right image are the base fiducial points $C^\prime$. 右侧图像上的青色标记是$C^\prime$的基本特征点。 where $d_{i,k}$ is the euclidean distance between $p^\prime_i$ and the $k$-th base fiducial point $c^\prime_k$ . 其中$d_{i,k}$是$p^\prime_i$和第k个基本点$c^\prime_k$之间的欧氏距离。 Green markers are the predicted fiducial points on the input images. 绿色标记是输入图像上预测的特定点。 Figure 6. Some initialization patterns for the fiducial points. 图6.各个点的一些初始化模式。 The initial biases are set to such values that yield the fiducial points pattern displayed in fig. 6. a. 初始偏差设置为这样的值，产生图6.a中显示的金属点图案。 We set the number of fiducial points to $K=20$ , meaning that the localization network outputs a 40-dimensional vector. 我们将多个点的数量设置为$K=20$，这意味着定位网络输出一个40维向量。 The left column is the input images, where green crosses are the predicted fiducial points. 左列是输入图像，其中绿色十字是预测的基准点。 fiducial points predicted by the STN are plotted on input images in green crosses. 由STN预测的基准点被绘制在绿色十字架的输入图像上。 We see that the STN tends to place fiducial points along upper and lower edges of scene text, and 我们看到STN倾向于沿场景文本的上下边缘放置特定点，并且
33	regress	[rɪˈgres]	The TPS transformation is configured by a set of fiducial points, whose coordinates are regressed by a convolutional neural network. TPS变换是由一组基准点决定，这些基准点的坐标就是由STN这个卷积神经网络回归出来的。 The localization network localizes K fiducial points by directly regressing their $x,y$ -coordinates. 本地化网络通过直接回归其 $x,y$-coordinates来定位K个点。
34	sequential	[sɪˈkwenʃl]	It bares some resemblance to a sequential signal. 它有点类似于顺序信号。 Given an input image, the encoder generates a sequential feature representation, which is a sequence of feature vectors. 编码器将输入的图像表示成序列的特征，即一系列的特征向量。 Su and Lu [34] extract sequential image representation, which is a sequence of HOG [10] descriptors, and predict the corresponding character sequence with a recurrent neural network (RNN). Su和Lu [34]提取序列图像表示，它是HOG [10]描述符的序列，并用递归神经网络（RNN）预测相应的字符序列。 We extract a sequential representation from $I^\prime$ , and recognize a word from it. 我们从$I^\prime$中提取顺序表示，并从中识别出一个单词。 The encoder extracts a sequential representation from the input image $I^\prime$. 编码器从输入图像$I^\prime$中提取顺序表示。 The decoder recurrently generates a sequence conditioned on the sequential representation, by decoding the relevant contents it attends to at each step. 解码器通过解码在每个步骤中所关注的相关内容，循环地生成以顺序表示为条件的序列。 A na¨ıve approach for extracting a sequential representation for $I^\prime$ is to take local image patches from left to right, and describe each of them with a CNN. 用于提取$I^\prime$的顺序表示的一种简单方法是从左到右获取图像中的局部图像块，并用CNN描述每个图像块。 Structure of the SRN, which consists of an encoder and a decoder. The encoder uses several convolution layers (ConvNet) and a two-layer BLSTM network to extract a sequential representation (h) for the input image. 编码器使用几个卷积层（ConvNet）和两层BLSTM网络来提取输入图像的顺序表示（h）。
35	recurrently	[rɪ'kʌrəntlɪ]	The decoder recurrently generates a character sequence conditioning on the input sequence, by decoding the relevant contents which are determined by its attention mechanism at each step. 解码器会根据注意力机制进行解码，循序地生成识别出的字符序列。 The decoder recurrently generates a sequence conditioned on the sequential representation, by decoding the relevant contents it attends to at each step. 解码器通过解码在每个步骤中所关注的相关内容，循环地生成以顺序表示为条件的序列。 The decoder recurrently generates a sequence of characters, conditioned on the sequence produced by the encoder. 解码器以编码器产生的序列为条件，反复生成一系列字符。
36	decode	[ˌdi:ˈkəʊd]	The decoder recurrently generates a character sequence conditioning on the input sequence, by decoding the relevant contents which are determined by its attention mechanism at each step. 解码器会根据注意力机制进行解码，循序地生成识别出的字符序列。 The decoder recurrently generates a sequence conditioned on the sequential representation, by decoding the relevant contents it attends to at each step. 解码器通过解码在每个步骤中所关注的相关内容，循环地生成以顺序表示为条件的序列。
37	geometric	[ˌdʒi:əˈmetrɪk]	Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN. 因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 The extensive experimental results show that 1) without geometric supervision, the learned model can automatically generate more “readable” images for both human and the sequence recognition network; 2) the proposed text rectification method can significantly improve recognition accuracies on irregular scene text; 3) the proposed scene text recognition system is competitive compared with the state-of-the-arts. 大量的实验结果表明，1)在没有几何监督的情况下，学习模型可以自动为人类和序列识别网络生成更“可读”的图像；2)提出的文本校正方法可以显著提高不规则场景文本的识别准确率；3)与现有技术相比，提出的场景文本识别系统具有竞争力。
38	i.e.	[ˌaɪ ˈi:]	Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN. 因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 According to the translation invariance property of CNN, each vector corresponds to a local image region, i.e. receptive field, and is a descriptor for that region. 根据CNN的平移不变性，每个矢量对应于局部图像区域，即接收场，并且是该区域的描述符。 where $l_{t-1}$ is the $t-1$-th ground-truth label in training, while in testing, it is the label predicted in the previous step, i.e. $\hat l_{t-1}$ . 其中$l_{t-1}$是训练中第$t-1$个真实标签，而在测试中，它是上一步中预测的标签，即$\hat l_{t-1}$。 When a test image is associated with a lexicon, i.e. a set of words for selection, the recognition process is to pick the word with the highest posterior conditional probability: 当测试图像与词典相关联时，即一组用于选择的单词时，识别过程是选择具有最高后验条件概率的单词：
39	differential	[ˌdɪfəˈrenʃl]	Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN. 因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained. 因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。
40	convolutional-recurrent	[!≈ kɒnvə'lu:ʃənəl rɪˈkʌrənt]	Third, our model adopts a convolutional-recurrent structure in the encoder of the SRN, thus is a novel variant of the attention-based model [4]. 第三，在SRN的编码器中，我们采用卷积循环结构，这是注意力模型的一种新颖的变体。 3.2.1 Encoder: Convolutional-Recurrent Network 3.2.1编码器：卷积 - 循环网络
41	variant	[ˈveəriənt]	Third, our model adopts a convolutional-recurrent structure in the encoder of the SRN, thus is a novel variant of the attention-based model [4]. 第三，在SRN的编码器中，我们采用卷积循环结构，这是注意力模型的一种新颖的变体。
42	Hough	[hɒk]	Among the traditional methods, many adopt bottom-up approaches, where individual characters are firstly detected using sliding window [36, 35], connected components [28], or Hough voting [39]. 在传统方法中，许多方法采用自下而上的方法，其中首先使用滑动窗口[36,35]，连通组件[28]或霍夫投票[39]来检测单个字符。
43	lexicon	[ˈleksɪkən]	Following that, the detected characters are integrated into words by means of dynamic programming, lexicon search [35], etc. Other work adopts top-down approaches, where text is directly recognized from entire input images, rather than detecting and recognizing individual characters. 之后，通过动态编程，词典搜索[35]等将检测到的字符集成到单词中。其他工作采用自上而下的方法，其中文本直接从整个输入图像中识别，而不是检测和识别单个字符。 3.4. Recognizing With a Lexicon 3.4 用词典识别 When a test image is associated with a lexicon, i.e. a set of words for selection, the recognition process is to pick the word with the highest posterior conditional probability: 当测试图像与词典相关联时，即一组用于选择的单词时，识别过程是选择具有最高后验条件概率的单词： However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words. 但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。 However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words. 但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。 We adopt an efficient approximate search scheme on large lexicons. 我们在大词典上采用了有效的近似搜索方案。 We first construct a prefix tree over a given lexicon. 我们首先在给定的词典上构建一个pre fi x树。 Since the tree depth is at most the length of the longest word in the lexicon, this search process takes much less computation than the precise search. 由于树深度最多是词典中最长单词的长度，因此该搜索过程比精确搜索所需的计算量少得多。 The titles “50”, “1k” and “50k” are lexicon sizes. 标题“50”，“1k”和“50k”是词典大小。 The “Full” lexicon contains all per-image lexicon words. “完整”词典包含所有每个图像的词典单词。 The “Full” lexicon contains all per-image lexicon words. “完整”词典包含所有每个图像的词典单词。 “None” means recognition without a lexicon. “无”表示没有词典的识别。 Without a lexicon, the model takes less than 2ms recognizing an image. 没有词典，模型识别图像所需的时间不到2毫秒。 With a lexicon, recognition speed depends on the lexicon size. 使用词典，识别速度取决于词典大小。 With a lexicon, recognition speed depends on the lexicon size. 使用词典，识别速度取决于词典大小。 We adopt the precise search (Sec. 3.4) when lexicon size ≤ 1k. 当词典大小≤1k时，我们采用精确搜索（第3.4节）。 On larger lexicons, we adopt the approximate beam search (Sec. 3.4) with a beam width of 7. 在较大的词典中，我们采用近似beam搜索（第3.4节），光束宽度为7。 With a 50k-word lexicon, the search takes ~200ms per image. 使用50k字的词典，每张图像搜索大约需要200毫秒。 For each image, there is a 50-word lexicon and a 1000-word lexicon. 对于每个图像，有一个50字的词典和一个1000字的词典。 For each image, there is a 50-word lexicon and a 1000-word lexicon. 对于每个图像，有一个50字的词典和一个1000字的词典。 All lexicons consist of a ground truth word and some randomly picked words. 所有词典都包含一个真实词和一些随机选择的词。 Each sample is associated with a 50-word lexicon. 每个样本与50个单词的词典相关联。 • ICDAR 2003 [24] (IC03) contains 860 cropped word images, each associated with a 50-word lexicon defined by Wang et al. [35]. •ICDAR 2003 [24]（IC03）包含860个裁剪的单词图像，每个图像与Wang等人[35]定义的50个单词的词典相关联。 Besides, there is a “full lexicon” which contains all lexicon words, and the Hunspell [1] lexicon which has 50k words. 此外，还有一个包含所有词典单词的“完整词典”和包含50k单词的Hunspell [1]词典。 Besides, there is a “full lexicon” which contains all lexicon words, and the Hunspell [1] lexicon which has 50k words. 此外，还有一个包含所有词典单词的“完整词典”和包含50k单词的Hunspell [1]词典。 Besides, there is a “full lexicon” which contains all lexicon words, and the Hunspell [1] lexicon which has 50k words. 此外，还有一个包含所有词典单词的“完整词典”和包含50k单词的Hunspell [1]词典。 On unconstrained recognition tasks (recognizing without a lexicon), our model outperforms all the other methods in comparison. 在无约束的识别任务（没有词典识别）的情况下，我们的模型在比较中优于所有其他方法。 On constrained recognition tasks (recognizing with a lexicon), RARE achieves state-of-the-art or highly competitive accuracies. 在受约束的识别任务（用词典识别）中，RARE实现了最先进或极具竞争力的准确性。 Each image is associated with a 50-word lexicon, which is inherited from the SVT [35] dataset. 每个图像都与一个50字的词典相关联，该词典继承自SVT [35]数据集。 In addition, there is a “Full” lexicon which contains all the per-image lexicon words. 此外，还有一个“Full”词典，其中包含所有每个图像的词典单词。 In addition, there is a “Full” lexicon which contains all the per-image lexicon words. 此外，还有一个“Full”词典，其中包含所有每个图像的词典单词。 “50” and “Full” represent recognition with 50-word lexicons and the full lexicon respectively. “50”和“Full”分别表示50字词典和完整词典的识别。 “50” and “Full” represent recognition with 50-word lexicons and the full lexicon respectively. “50”和“Full”分别表示50字词典和完整词典的识别。 “None” represents recognition without a lexicon. “无”表示没有词典的识别。 In the second and third columns, we compare the accuracies of recognition with the 50-word lexicon and the full lexicon. 在第二和第三列中，我们将识别的准确性与50字词汇和完整词典进行比较。 In the second and third columns, we compare the accuracies of recognition with the 50-word lexicon and the full lexicon. 在第二和第三列中，我们将识别的准确性与50字词汇和完整词典进行比较。 Our method outperforms [29], which is a perspective text recognition method, by a large margin on both lexicons. 我们的方法优于[29]，这是一种透视文本识别方法，在两个词典上都有很大的余地。 In the comparisons with [32], which uses the same training set as RARE, we still observe significant improvements in both the Full lexicon and the lexicon-free settings. 在与使用与RARE相同的训练集的[32]的比较中，我们仍然观察到Full lexicon和Lexicon-free设置的显着改进。 All models are evaluated without a lexicon. 所有模型都在没有词典的情况下进行评估。
44	Alm´azan		For example, Alm´azan et al. [2] propose to predict label embedding vectors from input images. 例如，Alm'azan等人[2]建议从输入图像预测标签嵌入向量。
45	Jaderberg		Jaderberg et al. [17] address text recognition with a 90k-class convolutional neural network, where each class corresponds to an English word. Jaderberg等人[17]使用90k级卷积神经网络进行文本识别，其中每个类对应一个英语单词。 Our model is trained on the 8-million synthetic samples released by Jaderberg et al. [15]. 我们的模型在Jaderberg等人[15]发布的800万个合成样本上进行训练。
46	k-class	[!≈ keɪ klɑ:s]	Jaderberg et al. [17] address text recognition with a 90k-class convolutional neural network, where each class corresponds to an English word. Jaderberg等人[17]使用90k级卷积神经网络进行文本识别，其中每个类对应一个英语单词。
47	unconstrained	[ˌʌnkən'streɪnd]	In [16], a CNN with a structured output layer is constructed for unconstrained text recognition. 在[16]中，构造具有结构化输出层的CNN用于无约束文本识别。 On unconstrained recognition tasks (recognizing without a lexicon), our model outperforms all the other methods in comparison. 在无约束的识别任务（没有词典识别）的情况下，我们的模型在比较中优于所有其他方法。
48	HOG	[hɒg]	Su and Lu [34] extract sequential image representation, which is a sequence of HOG [10] descriptors, and predict the corresponding character sequence with a recurrent neural network (RNN). Su和Lu [34]提取序列图像表示，它是HOG [10]描述符的序列，并用递归神经网络（RNN）预测相应的字符序列。
49	descriptor	[dɪˈskrɪptə(r)]	Su and Lu [34] extract sequential image representation, which is a sequence of HOG [10] descriptors, and predict the corresponding character sequence with a recurrent neural network (RNN). Su和Lu [34]提取序列图像表示，它是HOG [10]描述符的序列，并用递归神经网络（RNN）预测相应的字符序列。 Yao et al. [38] firstly propose the multi-oriented text detection problem, and deal with it by carefully designing rotation-invariant region descriptors. 姚等人[38]首先提出了多方向文本检测问题，并通过仔细设计旋转不变区域描述符来处理它。 Phan et al. propose to explicitly rectify perspective distortions via SIFT [23] descriptor matching. 潘等人建议通过SIFT [23]描述符匹配明确纠正透视失真。 According to the translation invariance property of CNN, each vector corresponds to a local image region, i.e. receptive field, and is a descriptor for that region. 根据CNN的平移不变性，每个矢量对应于局部图像区域，即接收场，并且是该区域的描述符。
50	recurrent	[rɪˈkʌrənt]	Su and Lu [34] extract sequential image representation, which is a sequence of HOG [10] descriptors, and predict the corresponding character sequence with a recurrent neural network (RNN). Su和Lu [34]提取序列图像表示，它是HOG [10]描述符的序列，并用递归神经网络（RNN）预测相应的字符序列。 Instead, following [32], we build a network that combines convolutional layers and recurrent networks. 相反，在[32]之后，我们构建了一个结合了卷积层和循环网络的网络。 The BLSTM is a recurrent network that can analyze the dependencies within a sequence in both directions, it outputs another sequence which has the same length as the input one. BLSTM是一个循环网络，可以在两个方向上分析序列中的依赖关系，它输出另一个序列，其长度与输入序列相同。 3.2.2 Decoder: Recurrent Character Generator 3.2.2解码器：循环字符发生器 It is a recurrent neural network with the attention structure proposed in [4, 8]. 它是一种递归神经网络，具有[4,8]中提出的注意结构。 In the recurrency part, we adopt the Gated Recurrent Unit (GRU) [7] as the cell. 在重发部分，我们采用门控循环单元（GRU）[7]作为单元。 The state $s_{t-1}$ is updated via the recurrent process of GRU [7, 8]: 状态$s_{t-1}$通过GRU [7,8]的循环过程更新：
51	rotation-invariant	[!≈ rəʊˈteɪʃn ɪnˈveəriənt]	Yao et al. [38] firstly propose the multi-oriented text detection problem, and deal with it by carefully designing rotation-invariant region descriptors. 姚等人[38]首先提出了多方向文本检测问题，并通过仔细设计旋转不变区域描述符来处理它。
52	leverage	[ˈli:vərɪdʒ]	Zhang et al. [42] propose a character rectification method that leverages the low-rank structures of text. 张等人 [42]提出了一种利用文本的低等级结构的字符整理方法。 Besides, the spatial dependencies between the patches are not exploited and leveraged. 此外，图像块之间的空间依赖性未被利用和利用。 Restricted by the sizes of the receptive fields, the feature sequence leverages limited image contexts. 受接收场的大小限制，特征序列利用有限的图像上下文。
53	Phan	[fæn]	Phan et al. propose to explicitly rectify perspective distortions via SIFT [23] descriptor matching. 潘等人建议通过SIFT [23]描述符匹配明确纠正透视失真。
54	SIFT	[sɪft]	Phan et al. propose to explicitly rectify perspective distortions via SIFT [23] descriptor matching. 潘等人建议通过SIFT [23]描述符匹配明确纠正透视失真。
55	above-mentioned	[ə'bʌv 'menʃnd]	The above-mentioned work brings insightful ideas into this issue. 上述工作为这个问题带来了深刻的见解。
56	insightful	[ˈɪnsaɪtfʊl]	The above-mentioned work brings insightful ideas into this issue. 上述工作为这个问题带来了深刻的见解。
57	annotation	[ˌænə'teɪʃn]	Moreover, it does not require extra annotations for the rectification process, since the STN is supervised by the SRN during training. 此外，它不需要额外的注释用于整理过程，因为STN在训练期间由SRN监督。
58	sampler	[ˈsɑ:mplə(r)]	The sampler takes both the grid and the input image, it produces a rectified image $I^\prime$ by sampling on the grid points. 采样器同时采用网格和输入图像，通过对网格点进行采样，生成一个矫正的图像$I^\prime$。 A distinctive property of STN is that its sampler is differentiable. STN的一个独特属性是其采样器是可微分的。 Structure of the STN. The localization network localizes a set of fiducial points C, with which the grid generator generates a sampling grid P. The sampler produces a rectified image $I^\prime$ , given I and P. 定位网络定位一组特定点C，网格生成器利用该集合点生成采样网格P.给定I和P时，采样器产生一个矫正的图像$I^\prime$。 3.1.3 Sampler 3.1.3 采样器 Lastly, in the sampler, the pixel value of $p^\prime_i$ is bilinearly interpolated from the pixels near $p_i$ on the input image. 最后，在采样器中，$p^\prime_i$的像素值是从输入图像上的$p_i$附近的像素进行双线性插值。 where $V$ represents the bilinear sampler [18], which is also a differentiable module. 其中V代表双线性采样器[18]，它也是一个可微分的模块。
59	differentiable	[ˌdɪfə'renʃɪəbl]	A distinctive property of STN is that its sampler is differentiable. STN的一个独特属性是其采样器是可微分的。 Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained. 因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。 Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained. 因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。 The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable. 网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。 where $V$ represents the bilinear sampler [18], which is also a differentiable module. 其中V代表双线性采样器[18]，它也是一个可微分的模块。 We address this problem in a more feasible and elegant way by adopting a differentiable spatial transformer network module. 我们通过采用可区分的空间变换网络模块，以一种更可行和更优雅的方式解决了这个问题。
60	back-propagate	[!≈ bæk ˈprɒpəgeɪt]	Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained. 因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。 The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable. 网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。
61	normalize	[ˈnɔ:məlaɪz]	We use a normalized coordinate system whose origin is the image center, so that $x_k, y_k$ are within the range of $[-1, 1]$ . 我们使用归一化坐标系，其原点是图像中心，因此$x_k, y_k$在$[-1, 1]$的范围内。 Since K is a constant and the coordinate system is normalized, $C^\prime$ is always a constant. 由于K是常数并且坐标系被归一化，因此$C^\prime$始终是常数。
62	propagate	[ˈprɒpəgeɪt]	Instead, the training of the localization network is completely supervised by the gradients propagated by the other parts of the STN, following the back-propagation algorithm [22]. 相反，定位网络的训练完全受到STN其他部分传播的梯度的监督，遵循反向传播算法[22]。
63	marker	[ˈmɑ:kə(r)]	Green markers on the left image are the fiducial points C. 左侧图像上的绿色标记是C点。 Cyan markers on the right image are the base fiducial points $C^\prime$. 右侧图像上的青色标记是$C^\prime$的基本特征点。 Green markers are the predicted fiducial points on the input images. 绿色标记是输入图像上预测的特定点。
64	Cyan	[ˈsaɪən]	Cyan markers on the right image are the base fiducial points $C^\prime$. 右侧图像上的青色标记是$C^\prime$的基本特征点。
65	euclidean	[ju:ˈklidiən]	where the element on the i-th row and j-th column of R is $r_{i,j}=d_{i,j}^2$ , $d_{i,j}$ is the euclidean distance between $c_i^\prime$ and $c_j^\prime$ . 其中R的第i行和第j列的元素是$r_{i,j}=d_{i,j}^2$，$d_{i,j}$是$c_i^\prime$和$c_j^\prime$之间的欧氏距离。 where $d_{i,k}$ is the euclidean distance between $p^\prime_i$ and the $k$-th base fiducial point $c^\prime_k$ . 其中$d_{i,k}$是$p^\prime_i$和第k个基本点$c^\prime_k$之间的欧氏距离。
66	iterate	[ˈɪtəreɪt]	By iterating over all points in $P^\prime$ , we generate a grid $P=\{p_i\}_{i=1,\cdots,N}$ on the input image $I$. 通过迭代$P^\prime$中的所有点，我们在输入图像$I$上生成网格$P=\{p_i\}_{i=1,\cdots,N}$。 The process iterates until a leaf node is reached. 该过程将迭代，直到到达叶节点。 However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words. 但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。
67	multiplication	[ˌmʌltɪplɪˈkeɪʃn]	The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable. 网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。
68	Eq		The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable. 网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。 The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable. 网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。 where the probability $p(\cdot)$ is computed by Eq. 8, $\theta$ is the parameters of both STN and SRN. 其中概率$p(\cdot)$由方程式8计算，$\theta$是STN和SRN的参数。 However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words. 但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。
69	Lastly	[ˈlɑ:stli]	Lastly, in the sampler, the pixel value of $p^\prime_i$ is bilinearly interpolated from the pixels near $p_i$ on the input image. 最后，在采样器中，$p^\prime_i$的像素值是从输入图像上的$p_i$附近的像素进行双线性插值。
70	bilinearly	[!≈ baɪ'lɪnɪəli]	Lastly, in the sampler, the pixel value of $p^\prime_i$ is bilinearly interpolated from the pixels near $p_i$ on the input image. 最后，在采样器中，$p^\prime_i$的像素值是从输入图像上的$p_i$附近的像素进行双线性插值。
71	interpolate	[ɪnˈtɜ:pəleɪt]	Lastly, in the sampler, the pixel value of $p^\prime_i$ is bilinearly interpolated from the pixels near $p_i$ on the input image. 最后，在采样器中，$p^\prime_i$的像素值是从输入图像上的$p_i$附近的像素进行双线性插值。
72	bilinear	[baɪ'lɪnɪə]	where $V$ represents the bilinear sampler [18], which is also a differentiable module. 其中V代表双线性采样器[18]，它也是一个可微分的模块。
73	exibility		The ﬂexibility of the TPS transformation allows us to transform irregular text images into rectified images that contain regular text. TPS转换的灵活性允许我们将不规则文本图像转换为包含常规文本的矫正图像。
74	loosely-bounded	[!≈ ˈlu:sli 'baʊndɪd]	In fig. 4, we show some common types of irregular text, including a) loosely-bounded text, which resulted by imperfect text detection; b) multi-oriented text, caused by non-horizontal camera views; c) perspective text, caused by side-view camera angles; d) curved text, a commonly seen artistic style. 在图4中，我们展示了一些常见类型的不规则文本，包括a）松散有界的文本，这是由不完美的文本检测引起的; b）由非水平摄像机视图引起的多向文本; c）由侧视摄像机角度引起的透视文本; d）弯曲的文字，一种常见的艺术风格。 The STN can deal with several types of irregular text, including (a) loosely-bounded text; (b) multi-oriented text; (c) perspective text; (d) curved text. STN可以处理几种类型的不规则文本，包括（a）松散有界的文本; （b）多方面文本; （c）透视文本; （d）弯曲文本。
75	imperfect	[ɪmˈpɜ:fɪkt]	In fig. 4, we show some common types of irregular text, including a) loosely-bounded text, which resulted by imperfect text detection; b) multi-oriented text, caused by non-horizontal camera views; c) perspective text, caused by side-view camera angles; d) curved text, a commonly seen artistic style. 在图4中，我们展示了一些常见类型的不规则文本，包括a）松散有界的文本，这是由不完美的文本检测引起的; b）由非水平摄像机视图引起的多向文本; c）由侧视摄像机角度引起的透视文本; d）弯曲的文字，一种常见的艺术风格。
76	inherently	[ɪnˈhɪərəntlɪ]	Since target words are inherently sequences of characters, we model the recognition problem as a sequence recognition problem, and address it with a sequence recognition network. 由于目标词本质上是字符序列，我们将识别问题建模为序列识别问题，并用序列识别网络对其进行处理。
77	horizontally	[ˌhɒrɪ'zɒntəlɪ]	The input to the SRN is a rectified image $I^\prime$ , which ideally contains a word that is written horizontally from left to right. SRN的输入是一个矫正的图像$I^\prime$，理想情况下包含一个从左到右水平写入的单词。
78	na¨ıve		A na¨ıve approach for extracting a sequential representation for $I^\prime$ is to take local image patches from left to right, and describe each of them with a CNN. 用于提取$I^\prime$的顺序表示的一种简单方法是从左到右获取图像中的局部图像块，并用CNN描述每个图像块。
79	arbitrary	[ˈɑ:bɪtrəri]	The network extracts a sequence of feature vectors, given an input image of arbitrary size. 给定任意大小的输入图像，网络提取特征向量序列。 Both input and output sequences may have arbitrary lengths. 输入和输出序列都可以具有任意长度。 [32] is able to recognize arbitrary words, but it does not have a specific mechanism for handling curved text. [32]能够识别任意单词，但没有处理弯曲文本的特定机制。
80	attens		Specifically, the “map-to-sequence” operation takes out the columns of the maps in the left-to-right order, and ﬂattens them into vectors. 具体而言，“map-to-sequence”操作以从左到右的顺序取出地图的列，并将fl视为向量。
81	invariance	[ɪn'veərɪəns]	According to the translation invariance property of CNN, each vector corresponds to a local image region, i.e. receptive field, and is a descriptor for that region. 根据CNN的平移不变性，每个矢量对应于局部图像区域，即接收场，并且是该区域的描述符。
82	receptive	[rɪˈseptɪv]	According to the translation invariance property of CNN, each vector corresponds to a local image region, i.e. receptive field, and is a descriptor for that region. 根据CNN的平移不变性，每个矢量对应于局部图像区域，即接收场，并且是该区域的描述符。 Restricted by the sizes of the receptive fields, the feature sequence leverages limited image contexts. 受接收场的大小限制，特征序列利用有限的图像上下文。
83	ConvNet		Structure of the SRN, which consists of an encoder and a decoder. The encoder uses several convolution layers (ConvNet) and a two-layer BLSTM network to extract a sequential representation (h) for the input image. 编码器使用几个卷积层（ConvNet）和两层BLSTM网络来提取输入图像的顺序表示（h）。
84	BLSTM	[!≈ bi: el es ti: em]	Structure of the SRN, which consists of an encoder and a decoder. The encoder uses several convolution layers (ConvNet) and a two-layer BLSTM network to extract a sequential representation (h) for the input image. 编码器使用几个卷积层（ConvNet）和两层BLSTM网络来提取输入图像的顺序表示（h）。 We further apply a two-layer Bidirectional Long-Short Term Memory (BLSTM) [14, 13] network to the sequence, in order to model the long-term dependencies within the sequence. 我们进一步将两层双向长短期记忆（BLSTM）[14,13]网络应用于序列，以模拟序列内的长期依赖性。 The BLSTM is a recurrent network that can analyze the dependencies within a sequence in both directions, it outputs another sequence which has the same length as the input one. BLSTM是一个循环网络，可以在两个方向上分析序列中的依赖关系，它输出另一个序列，其长度与输入序列相同。 On the top of the convolutional layers is a two-layer BLSTM network, each LSTM has 256 hidden units. 在卷积层的顶部是两层BLSTM网络，每个LSTM具有256个隐藏单元。
85	eo		The decoder generates a character sequence (including the EOS token) conditioned on h. 解码器生成以h为条件的字符序列（包括EOS令牌）。 The label space includes all English alphanumeric characters, plus a special “end-ofsequence” (EOS) token, which ends the generation process. 标签空间包括所有英文字母数字字符，以及一个特殊的“结束序列”（EOS）令牌，它结束生成过程。 A prefix tree of three words: “ten”, “tea”, and “to”. $\epsilon$ and $\Omega$ are the tree root and the EOS token respectively. $\epsilon$和$\Omega$分别是树根和EOS令牌。 Nodes on a path from the root to a leaf forms a word (including the EOS). 从根到叶子的路径上的节点形成一个单词（包括EOS）。 For the decoder, we use a GRU cell that has 256 memory blocks and 37 output units (26 letters, 10 digits, and 1 EOS token). 对于解码器，我们使用具有256个存储器块和37个输出单元（26个字母，10个数字和1个EOS令牌）的GRU单元。
86	Long-Short	[!≈ lɒŋ ʃɔ:t]	We further apply a two-layer Bidirectional Long-Short Term Memory (BLSTM) [14, 13] network to the sequence, in order to model the long-term dependencies within the sequence. 我们进一步将两层双向长短期记忆（BLSTM）[14,13]网络应用于序列，以模拟序列内的长期依赖性。
87	analyze	['ænəlaɪz]	The BLSTM is a recurrent network that can analyze the dependencies within a sequence in both directions, it outputs another sequence which has the same length as the input one. BLSTM是一个循环网络，可以在两个方向上分析序列中的依赖关系，它输出另一个序列，其长度与输入序列相同。
88	recurrency		In the recurrency part, we adopt the Gated Recurrent Unit (GRU) [7] as the cell. 在重发部分，我们采用门控循环单元（GRU）[7]作为单元。
89	GRU	[!≈ dʒi: ɑ:(r) ju:]	In the recurrency part, we adopt the Gated Recurrent Unit (GRU) [7] as the cell. 在重发部分，我们采用门控循环单元（GRU）[7]作为单元。 where $s_{t-1}$ is the state variable of the GRU cell at the last step. 其中$s_{t-1}$是最后一步GRU单元的状态变量。 The state $s_{t-1}$ is updated via the recurrent process of GRU [7, 8]: 状态$s_{t-1}$通过GRU [7,8]的循环过程更新： For the decoder, we use a GRU cell that has 256 memory blocks and 37 output units (26 letters, 10 digits, and 1 EOS token). 对于解码器，我们使用具有256个存储器块和37个输出单元（26个字母，10个数字和1个EOS令牌）的GRU单元。
90	linearly	[ˈliniəli]	Then, a glimpse $g_t$ is computed by linearly combining the vectors in $h$: $g_t=\sum_{i=1}^L\alpha_{ti}h_i$. 然后，通过线性组合$h$：$g_t=\sum_{i=1}^L\alpha_{ti}h_i$中的向量来计算一瞥$g_t$。
91	alphanumeric	[ˌælfənju:ˈmerɪk]	The label space includes all English alphanumeric characters, plus a special “end-ofsequence” (EOS) token, which ends the generation process. 标签空间包括所有英文字母数字字符，以及一个特殊的“结束序列”（EOS）令牌，它结束生成过程。
92	end-ofsequence		The label space includes all English alphanumeric characters, plus a special “end-ofsequence” (EOS) token, which ends the generation process. 标签空间包括所有英文字母数字字符，以及一个特殊的“结束序列”（EOS）令牌，它结束生成过程。
93	log-likelihood	[!≈ lɒg ˈlaɪklihʊd]	To train the model, we minimize the negative log-likelihood over X : 为了训练模型，我们最小化X上的负对数似然： After each step, the list is updated to store the nodes with top-B accumulated log-likelihoods, where B is the beam width. 在每个步骤之后，更新列表以存储具有前B累积对数似然的节点，其中B是波束宽度。
94	ADADELTA	[!≈ eɪ di: eɪ di: i: el ti: eɪ]	The optimization algorithm is the ADADELTA [41], which we find fast in convergence speed. 优化算法是ADADELTA [41]，我们发现其收敛速度很快。
95	convergence	[kən'vɜ:dʒəns]	The optimization algorithm is the ADADELTA [41], which we find fast in convergence speed. 优化算法是ADADELTA [41]，我们发现其收敛速度很快。 Randomly initializing the localization network results in failure of convergence during training. 随机初始化定位网络导致训练期间收敛失败。
96	Empirically	[ɪm'pɪrɪklɪ]	Empirically, we also find that the patterns displayed fig. 6. b and fig. 6. c yield relatively poorer performance. 根据经验，我们还发现图6.b和图6.c所示的模式产生相对较差的性能。
97	prefix	[ˈpri:fɪks]	A prefix tree of three words: “ten”, “tea”, and “to”. $\epsilon$ and $\Omega$ are the tree root and the EOS token respectively. $\epsilon$和$\Omega$分别是树根和EOS令牌。 The motivation is that computation can be shared among words that share the same prefix. 动机是计算可以在共享相同前缀的单词之间共享。 We first construct a prefix tree over a given lexicon. 我们首先在给定的词典上构建一个pre fi x树。
98	posterior	[pɒˈstɪəriə(r)]	At each step the posterior probabilities of all child nodes are computed. 在每个步骤中，计算所有子节点的后验概率。 Numbers on the edges are the posterior probabilities. 边缘上的数字是后验概率。 When a test image is associated with a lexicon, i.e. a set of words for selection, the recognition process is to pick the word with the highest posterior conditional probability: 当测试图像与词典相关联时，即一组用于选择的单词时，识别过程是选择具有最高后验条件概率的单词： In testing, we start from the root node, every time the model outputs a distribution $\hat y_t$ , the child node with the highest posterior probability is selected as the next node to move to. 在测试中，我们从根节点开始，每次模型输出分布$\hat y_t$时，选择具有最高后验概率的子节点作为要移动到的下一个节点。
99	conditional	[kənˈdɪʃənl]	When a test image is associated with a lexicon, i.e. a set of words for selection, the recognition process is to pick the word with the highest posterior conditional probability: 当测试图像与词典相关联时，即一组用于选择的单词时，识别过程是选择具有最高后验条件概率的单词：
100	Hunspell		However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words. 但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。 Besides, there is a “full lexicon” which contains all lexicon words, and the Hunspell [1] lexicon which has 50k words. 此外，还有一个包含所有词典单词的“完整词典”和包含50k单词的Hunspell [1]词典。
101	incorporate	[ɪnˈkɔ:pəreɪt]	Recognition performance could be further improved by incorporating beam search. 通过结合波束搜索可以进一步提高识别性能。
102	top-B	[!≈ tɒp bi:]	After each step, the list is updated to store the nodes with top-B accumulated log-likelihoods, where B is the beam width. 在每个步骤之后，更新列表以存储具有前B累积对数似然的节点，其中B是波束宽度。
103	synthetic	[sɪnˈθetɪk]	Our model is trained on the 8-million synthetic samples released by Jaderberg et al. [15]. 我们的模型在Jaderberg等人[15]发布的800万个合成样本上进行训练。 We use the same model trained on the synthetic dataset without fine-tuning. 我们使用在合成数据集上训练的相同模型而不进行微调。
104	resize	[ˌri:ˈsaɪz]	Following [17, 16], images are resized to $100 \times 32$ in both training and testing. 在[17,16]之后，在训练和测试中将图像调整为$100 \times 32$。
105	epoch	[ˈi:pɒk]	Our model processes ~160 samples per second during training, and converges in 2 days after ~3 epochs over the training dataset. 我们的模型在训练期间每秒处理~160个样本，并且在训练数据集的~3个时期之后的2天内收敛。
106	GPUaccelerated		Most parts of the model are GPUaccelerated. 该模型的大多数部分都是GPU加速的。
107	Xeon		All our experiments are carried out on a workstation which has one Intel Xeon(R) E5-2620 2.40GHz CPU, an NVIDIA GTX-Titan GPU, and 64GB RAM. 我们所有的实验都是在一个工作站上进行的，该工作站有一个Intel Xeon（R）E5-2620 2.40GHz CPU，一个NVIDIA GTX-Titan GPU和64GB RAM。
108	NVIDIA	[ɪn'vɪdɪə]	All our experiments are carried out on a workstation which has one Intel Xeon(R) E5-2620 2.40GHz CPU, an NVIDIA GTX-Titan GPU, and 64GB RAM. 我们所有的实验都是在一个工作站上进行的，该工作站有一个Intel Xeon（R）E5-2620 2.40GHz CPU，一个NVIDIA GTX-Titan GPU和64GB RAM。
109	GTX-Titan		All our experiments are carried out on a workstation which has one Intel Xeon(R) E5-2620 2.40GHz CPU, an NVIDIA GTX-Titan GPU, and 64GB RAM. 我们所有的实验都是在一个工作站上进行的，该工作站有一个Intel Xeon（R）E5-2620 2.40GHz CPU，一个NVIDIA GTX-Titan GPU和64GB RAM。
110	RAM	[ræm]	All our experiments are carried out on a workstation which has one Intel Xeon(R) E5-2620 2.40GHz CPU, an NVIDIA GTX-Titan GPU, and 64GB RAM. 我们所有的实验都是在一个工作站上进行的，该工作站有一个Intel Xeon（R）E5-2620 2.40GHz CPU，一个NVIDIA GTX-Titan GPU和64GB RAM。
111	k-word	[!≈ keɪ wɜ:d]	With a 50k-word lexicon, the search takes ~200ms per image. 使用50k字的词典，每张图像搜索大约需要200毫秒。
112	IIIT	[!≈ aɪ aɪ aɪ ti:]	• IIIT 5K-Words [25] (IIIT5K) contains 3000 cropped word images for testing. •IIIT 5K-Words [25]（IIIT5K）包含3000个用于测试的裁剪单词图像。
113	K-Words	[!≈ keɪ wɜ:dz]	• IIIT 5K-Words [25] (IIIT5K) contains 3000 cropped word images for testing. •IIIT 5K-Words [25]（IIIT5K）包含3000个用于测试的裁剪单词图像。
114	IIIT5K		• IIIT 5K-Words [25] (IIIT5K) contains 3000 cropped word images for testing. •IIIT 5K-Words [25]（IIIT5K）包含3000个用于测试的裁剪单词图像。 On IIIT5K, RARE outperforms prior art CRNN [32] by nearly 4 percentages, indicating a clear improvement in performance. 在IIIT5K上，RARE的性能比现有技术CRNN [32]高出近4个百分点，表明性能明显提高。 We observe that IIIT5K contains a lot of irregular text, especially curved text, while RARE has an advantage in dealing with irregular text. 我们观察到IIIT5K包含大量不规则文本，尤其是弯曲文本，而RARE在处理不规则文本方面具有优势。 On IIIT5K, SVT and IC03, constrained recognition accuracies are on par with [17], and slightly lower than [32]. 在IIIT5K，SVT和IC03上，约束识别精度与[17]相当，略低于[32]。
115	SVT	[!≈ es vi: ti:]	• Street View Text [35] (SVT) is collected from Google Street View. •街景文字[35]（SVT）是从Google街景中收集的。 Many images in SVT are severely corrupted by noise and blur, or have very low resolutions. SVT中的许多图像受到噪声和模糊的严重破坏，或者具有非常低的分辨率。 On IIIT5K, SVT and IC03, constrained recognition accuracies are on par with [17], and slightly lower than [32]. 在IIIT5K，SVT和IC03上，约束识别精度与[17]相当，略低于[32]。 Each image is associated with a 50-word lexicon, which is inherited from the SVT [35] dataset. 每个图像都与一个50字的词典相关联，该词典继承自SVT [35]数据集。
116	ICDAR	[!≈ aɪ si: di: eɪ ɑ:(r)]	• ICDAR 2003 [24] (IC03) contains 860 cropped word images, each associated with a 50-word lexicon defined by Wang et al. [35]. •ICDAR 2003 [24]（IC03）包含860个裁剪的单词图像，每个图像与Wang等人[35]定义的50个单词的词典相关联。 • ICDAR 2013 [20] (IC13) inherits most of its samples from IC03. •ICDAR 2013 [20]（IC13）继承了IC03的大部分样本。
117	IC03		• ICDAR 2003 [24] (IC03) contains 860 cropped word images, each associated with a 50-word lexicon defined by Wang et al. [35]. •ICDAR 2003 [24]（IC03）包含860个裁剪的单词图像，每个图像与Wang等人[35]定义的50个单词的词典相关联。 • ICDAR 2013 [20] (IC13) inherits most of its samples from IC03. •ICDAR 2013 [20]（IC13）继承了IC03的大部分样本。 After filtering samples as done in IC03, the dataset contains 857 samples. 在IC03中完成过滤样品后，数据集包含857个样品。 On IIIT5K, SVT and IC03, constrained recognition accuracies are on par with [17], and slightly lower than [32]. 在IIIT5K，SVT和IC03上，约束识别精度与[17]相当，略低于[32]。
118	non-alphanumeric	[!≈ nɒn ˌælfənju:ˈmerɪk]	Following [35], we discard images that contain non-alphanumeric characters or have less than three characters. 按照[35]，我们丢弃包含非字母数字字符或少于三个字符的图像。
119	IC13		• ICDAR 2013 [20] (IC13) inherits most of its samples from IC03. •ICDAR 2013 [20]（IC13）继承了IC03的大部分样本。
120	Tab	[tæb]	In Tab. 1 we report our results, and compare them with other methods. 在表1中，我们报告实验结果，并与其他方法进行比较。 As reported in the last row of Tab. 正如Tab的最后一行所报道的那样。 Tab. 2 summarizes the results. 表 2总结了结果。 Furthermore, recall the results in Tab. 1, on SVTPerspective RARE outperforms [32] by a even larger margin. 此外，回顾表1中的结果，在SVTP上，RARE的表现优于[32]达到更大的余地。 From the results summarized in Tab. 3, we see that RARE outperforms the other two methods by a large margin. 从表3中总结的结果，我们看到RARE的性能远远超过其他两种方法。
121	k-dictionary	[!≈ keɪ ˈdɪkʃənri]	[17] only recognizes words that are in its 90k-dictionary. [17]只识别其90k字典中的单词。
122	par	[pɑ:(r)]	On IIIT5K, SVT and IC03, constrained recognition accuracies are on par with [17], and slightly lower than [32]. 在IIIT5K，SVT和IC03上，约束识别精度与[17]相当，略低于[32]。
123	SRN-only		1, we see that the SRN-only model is also a very competitive recognizer, achieving higher or competitive performance on most of the benchmarks. 1，我们看到仅SRN模型也是一个非常有竞争力的识别器，在大多数基准测试中实现了更高或更具竞争力的性能。
124	validate	[ˈvælɪdeɪt]	To validate the effectiveness of the rectification scheme, we evaluate RARE on the task of perspective text recognition. 为了验证整合方案的有效性，我们评估了RARE对透视文本识别的任务。
125	SVT-perspective		SVT-Perspective [29] is specifically designed for evaluating performance of perspective text recognition algorithms. SVT-Perspective [29]专门用于评估透视文本识别算法的性能。 Text samples in SVT-Perspective are picked from side view angles in Google Street View, thus most of them are heavily deformed by perspective distortion. SVT-Perspective中的文本样本是从Google街景中的侧视角中选取的，因此大多数文本样本都因透视变形而严重变形。 a. SVT-Perspective consists of 639 cropped images for testing. SVT-Perspective由639个裁剪图像组成，用于测试。 Samples are taken from the SVT-Perspective [29] dataset; b) Curved text. Samples are taken from the CUTE80 [30] dataset. 样本取自CUTE80 [30]数据集。 For comparison, we test the CRNN model [32] on SVT-Perspective. 为了比较，我们在SVT-Perspective上测试CRNN模型[32]。 Table 2. Recognition accuracies on SVT-Perspective [29]. 表2. SVT-Perspective的识别准确度[29]。 The reason is that the SVT-perspective dataset mainly consists of perspective text, which is inappropriate for direct recognition. 原因是SVT透视数据集主要由透视文本组成，不适合直接识别。 The first five rows are taken from SVT-Perspective [29], the rest rows are taken from CUTE80 [30]. 前五行取自SVT-透视图[29]，其余行取自CUTE80[30]。
126	deformed	[dɪˈfɔ:md]	Text samples in SVT-Perspective are picked from side view angles in Google Street View, thus most of them are heavily deformed by perspective distortion. SVT-Perspective中的文本样本是从Google街景中的侧视角中选取的，因此大多数文本样本都因透视变形而严重变形。
127	CUTE80		Samples are taken from the SVT-Perspective [29] dataset; b) Curved text. Samples are taken from the CUTE80 [30] dataset. 样本取自CUTE80 [30]数据集。 The first five rows are taken from SVT-Perspective [29], the rest rows are taken from CUTE80 [30]. 前五行取自SVT-透视图[29]，其余行取自CUTE80[30]。 CUTE80 [30] focuses on the recognition of curved text. CUTE80 [30]专注于弯曲文本的识别。 Table 3. Recognition accuracies on CUTE80 [29]. 表3.CUTE80上的识别准确度[29]。
128	SVTPerspective		Furthermore, recall the results in Tab. 1, on SVTPerspective RARE outperforms [32] by a even larger margin. 此外，回顾表1中的结果，在SVTP上，RARE的表现优于[32]达到更大的余地。
129	alleviate	[əˈli:vieɪt]	Our rectification scheme can significantly alleviate this problem. 我们的整改计划可以显着缓解这一问题。 Generally, the rectification made by the STN is not perfect, but it alleviates the recognition difficulty to some extent. 一般来说，STN所做的纠正并不完美，但在一定程度上缓解了识别的困难。
130	gray-scale	[ɡ'reɪsk'eɪl]	The middle column is the rectified images (we use gray-scale images for recognition). 中间一列是校正后的图像(我们使用灰度图像进行识别)。
131	mistakenly	[mɪ'steɪkənlɪ]	Green and red characters are correctly and mistakenly recognized characters, respectively. 绿色和红色字符分别是正确和错误识别的字符。
132	qualitative	[ˈkwɒlɪtətɪv]	In fig. 9 we present some qualitative analysis. 在图9中，我们提出了一些定性分析。
133	artistic-style	[!≈ ɑ:ˈtɪstɪk staɪl]	Curved text is a commonly seen artistic-style text in natural scenes. 弯曲文本是自然场景中常见的艺术风格文本。
134	advantageous	[ˌædvənˈteɪdʒəs]	Therefore, it is advantageous on this task. 因此，在这项任务上是有利的。
135	acknowledgment	[ək'nɒlɪdʒmənt]	Acknowledgments 致谢
136	NSFC	[!≈ en es ef si:]	This work was primarily supported by National Natural Science Foundation of China (NSFC) (No. 61222308, No. 61573160 and No. 61503145), and Open Project Program of the State Key Laboratory of Digital Publishing Technology (No. F2016001). 本工作主要得到国家自然科学基金(61222308，61573160，61503145)和数字出版技术国家重点实验室开放项目(No. F2016001)的支持。
137	F2016001		This work was primarily supported by National Natural Science Foundation of China (NSFC) (No. 61222308, No. 61573160 and No. 61503145), and Open Project Program of the State Key Laboratory of Digital Publishing Technology (No. F2016001). 本工作主要得到国家自然科学基金(61222308，61573160，61503145)和数字出版技术国家重点实验室开放项目(No. F2016001)的支持。

Words List (frequency)
#	word (frequency)	phonetic	sentence
1	lexicon (39)	[ˈleksɪkən]	Following that, the detected characters are integrated into words by means of dynamic programming, lexicon search [35], etc. Other work adopts top-down approaches, where text is directly recognized from entire input images, rather than detecting and recognizing individual characters.之后，通过动态编程，词典搜索[35]等将检测到的字符集成到单词中。其他工作采用自上而下的方法，其中文本直接从整个输入图像中识别，而不是检测和识别单个字符。 3.4. Recognizing With a Lexicon3.4 用词典识别 When a test image is associated with a lexicon, i.e. a set of words for selection, the recognition process is to pick the word with the highest posterior conditional probability:当测试图像与词典相关联时，即一组用于选择的单词时，识别过程是选择具有最高后验条件概率的单词： However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words.但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。 However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words.但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。 We adopt an efficient approximate search scheme on large lexicons.我们在大词典上采用了有效的近似搜索方案。 We first construct a prefix tree over a given lexicon.我们首先在给定的词典上构建一个pre fi x树。 Since the tree depth is at most the length of the longest word in the lexicon, this search process takes much less computation than the precise search.由于树深度最多是词典中最长单词的长度，因此该搜索过程比精确搜索所需的计算量少得多。 The titles “50”, “1k” and “50k” are lexicon sizes.标题“50”，“1k”和“50k”是词典大小。 The “Full” lexicon contains all per-image lexicon words.“完整”词典包含所有每个图像的词典单词。 The “Full” lexicon contains all per-image lexicon words.“完整”词典包含所有每个图像的词典单词。 “None” means recognition without a lexicon. “无”表示没有词典的识别。 Without a lexicon, the model takes less than 2ms recognizing an image.没有词典，模型识别图像所需的时间不到2毫秒。 With a lexicon, recognition speed depends on the lexicon size.使用词典，识别速度取决于词典大小。 With a lexicon, recognition speed depends on the lexicon size.使用词典，识别速度取决于词典大小。 We adopt the precise search (Sec. 3.4) when lexicon size ≤ 1k.当词典大小≤1k时，我们采用精确搜索（第3.4节）。 On larger lexicons, we adopt the approximate beam search (Sec. 3.4) with a beam width of 7.在较大的词典中，我们采用近似beam搜索（第3.4节），光束宽度为7。 With a 50k-word lexicon, the search takes ~200ms per image.使用50k字的词典，每张图像搜索大约需要200毫秒。 For each image, there is a 50-word lexicon and a 1000-word lexicon.对于每个图像，有一个50字的词典和一个1000字的词典。 For each image, there is a 50-word lexicon and a 1000-word lexicon.对于每个图像，有一个50字的词典和一个1000字的词典。 All lexicons consist of a ground truth word and some randomly picked words.所有词典都包含一个真实词和一些随机选择的词。 Each sample is associated with a 50-word lexicon.每个样本与50个单词的词典相关联。 • ICDAR 2003 [24] (IC03) contains 860 cropped word images, each associated with a 50-word lexicon defined by Wang et al. [35].•ICDAR 2003 [24]（IC03）包含860个裁剪的单词图像，每个图像与Wang等人[35]定义的50个单词的词典相关联。 Besides, there is a “full lexicon” which contains all lexicon words, and the Hunspell [1] lexicon which has 50k words.此外，还有一个包含所有词典单词的“完整词典”和包含50k单词的Hunspell [1]词典。 Besides, there is a “full lexicon” which contains all lexicon words, and the Hunspell [1] lexicon which has 50k words.此外，还有一个包含所有词典单词的“完整词典”和包含50k单词的Hunspell [1]词典。 Besides, there is a “full lexicon” which contains all lexicon words, and the Hunspell [1] lexicon which has 50k words.此外，还有一个包含所有词典单词的“完整词典”和包含50k单词的Hunspell [1]词典。 On unconstrained recognition tasks (recognizing without a lexicon), our model outperforms all the other methods in comparison.在无约束的识别任务（没有词典识别）的情况下，我们的模型在比较中优于所有其他方法。 On constrained recognition tasks (recognizing with a lexicon), RARE achieves state-of-the-art or highly competitive accuracies.在受约束的识别任务（用词典识别）中，RARE实现了最先进或极具竞争力的准确性。 Each image is associated with a 50-word lexicon, which is inherited from the SVT [35] dataset.每个图像都与一个50字的词典相关联，该词典继承自SVT [35]数据集。 In addition, there is a “Full” lexicon which contains all the per-image lexicon words.此外，还有一个“Full”词典，其中包含所有每个图像的词典单词。 In addition, there is a “Full” lexicon which contains all the per-image lexicon words.此外，还有一个“Full”词典，其中包含所有每个图像的词典单词。 “50” and “Full” represent recognition with 50-word lexicons and the full lexicon respectively. “50”和“Full”分别表示50字词典和完整词典的识别。 “50” and “Full” represent recognition with 50-word lexicons and the full lexicon respectively. “50”和“Full”分别表示50字词典和完整词典的识别。 “None” represents recognition without a lexicon. “无”表示没有词典的识别。 In the second and third columns, we compare the accuracies of recognition with the 50-word lexicon and the full lexicon.在第二和第三列中，我们将识别的准确性与50字词汇和完整词典进行比较。 In the second and third columns, we compare the accuracies of recognition with the 50-word lexicon and the full lexicon.在第二和第三列中，我们将识别的准确性与50字词汇和完整词典进行比较。 Our method outperforms [29], which is a perspective text recognition method, by a large margin on both lexicons.我们的方法优于[29]，这是一种透视文本识别方法，在两个词典上都有很大的余地。 In the comparisons with [32], which uses the same training set as RARE, we still observe significant improvements in both the Full lexicon and the lexicon-free settings.在与使用与RARE相同的训练集的[32]的比较中，我们仍然观察到Full lexicon和Lexicon-free设置的显着改进。 All models are evaluated without a lexicon.所有模型都在没有词典的情况下进行评估。
2	STN (25)	[!≈ es ti: en]	RARE is a specially designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN).RARE是一种特殊设计的深度神经网络，由空间变换网络（STN）和序列识别网络（SRN）组成。 Figure 1. Schematic overview of RARE, which consists a spatial transformer network (STN) and a sequence recognition network (SRN).图1. RARE的示意图，包括空间变换器网络（STN）和序列识别网络（SRN）。 The STN transforms an input image to a rectified image, while the SRN recognizes text.STN将输入图像变换为矫正图像，而SRN识别文本。 Specifically, we construct a deep neural network that combines a Spatial Transformer Network [18] (STN) and a Sequence Recognition Network (SRN).具体而言，我们构建了一个深度神经网络，它结合了空间变换器网络[18]（STN）和序列识别网络（SRN）。 In the STN, an input image is spatially transformed into a rectified image.在STN中，输入图像在空间上变换成校正后的图像。 Ideally, the STN produces an image that contains regular text, which is a more appropriate input for the SRN than the original one.在理想情况下，STN产生的图像是一类常规的文本图像，这比原来的不规则的文本图像更合适输入到SRN中。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN.因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 In practice, the training eventually makes the STN tend to produce images that contain regular text, which are desirable inputs for the SRN.在实践中，训练最终会使STN倾向于产生包含常规文本的图像，这些图像正是SRN的理想输入。 Second, our model extends the STN framework [18] with an attention-based model.第二，我们的模型扩展了以注意为基础的STN框架的模型。 The original STN is only tested on plain convolutional neural networks.原本的STN仅在普通卷积神经网络上进行测试。 Moreover, it does not require extra annotations for the rectification process, since the STN is supervised by the SRN during training.此外，它不需要额外的注释用于整理过程，因为STN在训练期间由SRN监督。 The STN transforms an input image I to a rectified image $I^\prime$ with a predicted TPS transformation.STN将输入图像I转换为具有预测的TPS变换的矫正图像$I^\prime$。 A distinctive property of STN is that its sampler is differentiable.STN的一个独特属性是其采样器是可微分的。 Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained.因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。 Structure of the STN. The localization network localizes a set of fiducial points C, with which the grid generator generates a sampling grid P. The sampler produces a rectified image $I^\prime$ , given I and P.定位网络定位一组特定点C，网格生成器利用该集合点生成采样网格P.给定I和P时，采样器产生一个矫正的图像$I^\prime$。 Instead, the training of the localization network is completely supervised by the gradients propagated by the other parts of the STN, following the back-propagation algorithm [22].相反，定位网络的训练完全受到STN其他部分传播的梯度的监督，遵循反向传播算法[22]。 The STN is able to rectify images that contain these types of irregular text, making them more readable for the following recognizer.STN能够纠正包含这些类型的不规则文本的图像，使其对于以下识别器更具可读性。 Figure 4. The STN rectifies images that contain several types of irregular text.图4. STN重新构建包含多种不规则文本的图像。 The STN can deal with several types of irregular text, including (a) loosely-bounded text; (b) multi-oriented text; (c) perspective text; (d) curved text.STN可以处理几种类型的不规则文本，包括（a）松散有界的文本; （b）多方面文本; （c）透视文本; （d）弯曲文本。 where the probability $p(\cdot)$ is computed by Eq. 8, $\theta$ is the parameters of both STN and SRN.其中概率$p(\cdot)$由方程式8计算，$\theta$是STN和SRN的参数。 Spatial Transformer Network The localization network of STN has 4 convolution layers, each followed by a $2 \times 2$ max-pooling layer.空间变换器网络STN的定位网络有4个卷积层，每个卷层都有一个$2 \times 2$最大池层。 The output size of the STN is also $100 \times 32$.STN的输出大小也是$100 \times 32$。 fiducial points predicted by the STN are plotted on input images in green crosses.由STN预测的基准点被绘制在绿色十字架的输入图像上。 We see that the STN tends to place fiducial points along upper and lower edges of scene text, and我们看到STN倾向于沿场景文本的上下边缘放置特定点，并且 Generally, the rectification made by the STN is not perfect, but it alleviates the recognition difficulty to some extent.一般来说，STN所做的纠正并不完美，但在一定程度上缓解了识别的困难。
3	fiducial (24)	[fɪ'dju:ʃjəl]	The TPS transformation is configured by a set of fiducial points, whose coordinates are regressed by a convolutional neural network.TPS变换是由一组基准点决定，这些基准点的坐标就是由STN这个卷积神经网络回归出来的。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN.因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 As illustrated in fig. 2, it first predicts a set of fiducial points via its localization network.如图2所示，它首先通过其定位网络预测一组特定点。 Then, inside the grid generator, it calculates the TPS transformation parameters from the fiducial points, and generates a sampling grid on I.然后，在网格生成器内部，它从各个点计算TPS变换参数，并在I上生成采样网格。 Structure of the STN. The localization network localizes a set of fiducial points C, with which the grid generator generates a sampling grid P. The sampler produces a rectified image $I^\prime$ , given I and P.定位网络定位一组特定点C，网格生成器利用该集合点生成采样网格P.给定I和P时，采样器产生一个矫正的图像$I^\prime$。 The localization network localizes K fiducial points by directly regressing their $x,y$ -coordinates.本地化网络通过直接回归其 $x,y$-coordinates来定位K个点。 The coordinates are denoted by $C=[c_1,\cdots, c_K] \in R^{2 \times K}$ , whose k-th column $c_k=[x_k,y_k]^T$ contains the coordinates of the k-th fiducial point.坐标由 $C=[c_1,\cdots, c_K] \in R^{2 \times K}$表示，其第k列包含第k个特定点的坐标。 The network localizes fiducial points based on global image contexts.网络基于全局图像上下文定位各个点。 It is expected to capture the overall text shape of an input image, and localizes fiducial points accordingly.期望捕获输入图像的整体文本形状，并相应地定位各个点。 It should be emphasized that we do not annotate coordinates of fiducial points for any sample.应该强调的是，我们不会为任何样本注释各个点的坐标。 We first define another set of fiducial points, called the base fiducial points, denoted by $C^\prime=[c_1^\prime,\cdots, c_K^\prime] \in R^{2 \times K}$.我们首先定义了另一组特殊点，称为基本点，由$C^\prime=[c_1^\prime,\cdots, c_K^\prime] \in R^{2 \times K}$表示。 We first define another set of fiducial points, called the base fiducial points, denoted by $C^\prime=[c_1^\prime,\cdots, c_K^\prime] \in R^{2 \times K}$.我们首先定义了另一组特殊点，称为基本点，由$C^\prime=[c_1^\prime,\cdots, c_K^\prime] \in R^{2 \times K}$表示。 As illustrated in fig. 3, the base fiducial points are evenly distributed along the top and bottom edge of a rectified image $I^\prime$.如图3所示，基本金属点沿着矫正图像$I^\prime$的顶部和底部边缘均匀分布。 Figure 3. fiducial points and the TPS transformation.图3.基准点和TPS转换。 Green markers on the left image are the fiducial points C.左侧图像上的绿色标记是C点。 Cyan markers on the right image are the base fiducial points $C^\prime$.右侧图像上的青色标记是$C^\prime$的基本特征点。 where $d_{i,k}$ is the euclidean distance between $p^\prime_i$ and the $k$-th base fiducial point $c^\prime_k$ .其中$d_{i,k}$是$p^\prime_i$和第k个基本点$c^\prime_k$之间的欧氏距离。 Green markers are the predicted fiducial points on the input images.绿色标记是输入图像上预测的特定点。 Figure 6. Some initialization patterns for the fiducial points.图6.各个点的一些初始化模式。 The initial biases are set to such values that yield the fiducial points pattern displayed in fig. 6. a.初始偏差设置为这样的值，产生图6.a中显示的金属点图案。 We set the number of fiducial points to $K=20$ , meaning that the localization network outputs a 40-dimensional vector.我们将多个点的数量设置为$K=20$，这意味着定位网络输出一个40维向量。 The left column is the input images, where green crosses are the predicted fiducial points.左列是输入图像，其中绿色十字是预测的基准点。 fiducial points predicted by the STN are plotted on input images in green crosses.由STN预测的基准点被绘制在绿色十字架的输入图像上。 We see that the STN tends to place fiducial points along upper and lower edges of scene text, and我们看到STN倾向于沿场景文本的上下边缘放置特定点，并且
4	SRN (20)	[!≈ es ɑ:(r) en]	RARE is a specially designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN).RARE是一种特殊设计的深度神经网络，由空间变换网络（STN）和序列识别网络（SRN）组成。 In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach.在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。 Figure 1. Schematic overview of RARE, which consists a spatial transformer network (STN) and a sequence recognition network (SRN).图1. RARE的示意图，包括空间变换器网络（STN）和序列识别网络（SRN）。 The STN transforms an input image to a rectified image, while the SRN recognizes text.STN将输入图像变换为矫正图像，而SRN识别文本。 Specifically, we construct a deep neural network that combines a Spatial Transformer Network [18] (STN) and a Sequence Recognition Network (SRN).具体而言，我们构建了一个深度神经网络，它结合了空间变换器网络[18]（STN）和序列识别网络（SRN）。 Ideally, the STN produces an image that contains regular text, which is a more appropriate input for the SRN than the original one.在理想情况下，STN产生的图像是一类常规的文本图像，这比原来的不规则的文本图像更合适输入到SRN中。 Motivated by this, for the SRN we construct an attention-based model [4] that recognizes text in a sequence recognition approach.受此启发，我们构建了SRN，这是一种在序列识别中采用了注意力的模型。 The SRN consists of an encoder and a decoder.SRN由编码器和解码器构成。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN.因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 In practice, the training eventually makes the STN tend to produce images that contain regular text, which are desirable inputs for the SRN.在实践中，训练最终会使STN倾向于产生包含常规文本的图像，这些图像正是SRN的理想输入。 Third, our model adopts a convolutional-recurrent structure in the encoder of the SRN, thus is a novel variant of the attention-based model [4].第三，在SRN的编码器中，我们采用卷积循环结构，这是注意力模型的一种新颖的变体。 Moreover, it does not require extra annotations for the rectification process, since the STN is supervised by the SRN during training.此外，它不需要额外的注释用于整理过程，因为STN在训练期间由SRN监督。 The input to the SRN is a rectified image $I^\prime$ , which ideally contains a word that is written horizontally from left to right.SRN的输入是一个矫正的图像$I^\prime$，理想情况下包含一个从左到右水平写入的单词。 In our model, the SRN is an attention-based model [4, 8], which directly recognizes a sequence from an input image.在我们的模型中，SRN是一种基于注意力的模型[4,8]，它直接识别来自输入图像的序列。 The SRN consists of an encoder and a decoder. SRN由编码器和解码器组成。 Structure of the SRN, which consists of an encoder and a decoder. The encoder uses several convolution layers (ConvNet) and a two-layer BLSTM network to extract a sequential representation (h) for the input image.编码器使用几个卷积层（ConvNet）和两层BLSTM网络来提取输入图像的顺序表示（h）。 The SRN directly maps a input sequence to another sequence.SRN直接将输入序列映射到另一个序列。 where the probability $p(\cdot)$ is computed by Eq. 8, $\theta$ is the parameters of both STN and SRN.其中概率$p(\cdot)$由方程式8计算，$\theta$是STN和SRN的参数。 Sequence Recognition Network In the SRN, the encoder has 7 convolutional layers, whose {filter size, number of filters, stride, padding size} are respectively {3,64,1,1}, {3,128,1,1}, {3,256,1,1}, {3,256,1,1,}, {3,512,1,1}, {3,512,1,1}, and {2,512,1,0}.序列识别网络在SRN中，编码器有7个卷积层，其{滤波器大小，滤波器数量，步幅，填充大小}分别为{3,64,1,1}，{3,128,1,1}，{3,256 ，1,1}，{3,256,1,1，}，{3,512,1,1}，{3,512,1,1}和{2,512,1,0}。 We also train and test a model that contains only the SRN.我们还训练和测试仅包含SRN的模型。
5	rectify (19)	[ˈrektɪfaɪ]	In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach.在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。 The STN transforms an input image to a rectified image, while the SRN recognizes text.STN将输入图像变换为矫正图像，而SRN识别文本。 This motivates us to apply a spatial transformation prior to recognition, in order to rectify input images into ones that are more “readable” by recognizers.这促使我们在识别之前应用空间变换，以便将输入图像校正为识别器更“可读”的图像。 In the STN, an input image is spatially transformed into a rectified image.在STN中，输入图像在空间上变换成校正后的图像。 The transformation is a thinplate-spline [6] (TPS) transformation, whose nonlinearity allows us to rectify various types of irregular text, including perspective and curved text.STN的空间变换是一个薄板样条（TPS）变换，这种变换的非线可以纠正各种类型的不规则文本，包括透视和弯曲文本。 Phan et al. propose to explicitly rectify perspective distortions via SIFT [23] descriptor matching.潘等人建议通过SIFT [23]描述符匹配明确纠正透视失真。 Our method rectifies several types of irregular text in a unified way.我们的方法以统一的方式重新定义了几种不规则文本。 The STN transforms an input image I to a rectified image $I^\prime$ with a predicted TPS transformation.STN将输入图像I转换为具有预测的TPS变换的矫正图像$I^\prime$。 The sampler takes both the grid and the input image, it produces a rectified image $I^\prime$ by sampling on the grid points.采样器同时采用网格和输入图像，通过对网格点进行采样，生成一个矫正的图像$I^\prime$。 Structure of the STN. The localization network localizes a set of fiducial points C, with which the grid generator generates a sampling grid P. The sampler produces a rectified image $I^\prime$ , given I and P.定位网络定位一组特定点C，网格生成器利用该集合点生成采样网格P.给定I和P时，采样器产生一个矫正的图像$I^\prime$。 As illustrated in fig. 3, the base fiducial points are evenly distributed along the top and bottom edge of a rectified image $I^\prime$.如图3所示，基本金属点沿着矫正图像$I^\prime$的顶部和底部边缘均匀分布。 The grid of pixels on a rectified image $I^\prime$ is denoted by $P^\prime = {p_i^\prime}_{i=1,\cdots,N}$ , where $p_i^\prime = {x_i^\prime, y_i^\prime}^T$ is the x,y coordinates of the i-th pixel, N is the number of pixels.矫正图像上的像素网格由$P^\prime = {p_i^\prime}_{i=1,\cdots,N}$表示，其中$p_i^\prime = {x_i^\prime, y_i^\prime}^T$是第i个像素的x，y坐标，N是像素数。 By setting all pixel values, we get the rectified image $T^\prime$ :通过设置所有像素值，我们得到了矫正的图像$T^\prime$： The ﬂexibility of the TPS transformation allows us to transform irregular text images into rectified images that contain regular text.TPS转换的灵活性允许我们将不规则文本图像转换为包含常规文本的矫正图像。 The STN is able to rectify images that contain these types of irregular text, making them more readable for the following recognizer.STN能够纠正包含这些类型的不规则文本的图像，使其对于以下识别器更具可读性。 The input to the SRN is a rectified image $I^\prime$ , which ideally contains a word that is written horizontally from left to right.SRN的输入是一个矫正的图像$I^\prime$，理想情况下包含一个从左到右水平写入的单词。 Figure 4. The STN rectifies images that contain several types of irregular text.图4. STN重新构建包含多种不规则文本的图像。 The middle column is the rectified images (we use gray-scale images for recognition).中间一列是校正后的图像(我们使用灰度图像进行识别)。 Our model rectifies images that contain curved text before recognizing them.我们的模型在识别包含弯曲文本的图像之前对其进行校正。
6	rectification (11)	[ˌrektɪfɪ'keɪʃn]	Robust Scene Text Recognition with Automatic Rectification具有自动校正的可靠场景文本识别器 Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text.与文档中的文字不同，自然图像中的文字通常具有不规则形状，这是由透视扭曲，弯曲字符放置等引起的。我们提出了RARE（具有自动重整功能的可靠文本识别器），这是一种对不规则文本具有可靠性的识别模型。 Zhang et al. [42] propose a character rectification method that leverages the low-rank structures of text.张等人 [42]提出了一种利用文本的低等级结构的字符整理方法。 Moreover, it does not require extra annotations for the rectification process, since the STN is supervised by the SRN during training.此外，它不需要额外的注释用于整理过程，因为STN在训练期间由SRN监督。 To validate the effectiveness of the rectification scheme, we evaluate RARE on the task of perspective text recognition.为了验证整合方案的有效性，我们评估了RARE对透视文本识别的任务。 Our rectification scheme can significantly alleviate this problem.我们的整改计划可以显着缓解这一问题。 Figure 9. Examples showing the rectifications our model makes and the recognition results.图9. 示例显示了我们的模型所做的纠正和识别结果。 In Fig. 9, we demonstrate the effect of rectification through some examples.在图9中，我们通过一些例子演示了整改的效果。 Generally, the rectification made by the STN is not perfect, but it alleviates the recognition difficulty to some extent.一般来说，STN所做的纠正并不完美，但在一定程度上缓解了识别的困难。 Traditional solutions typically use a separate text rectification component.传统的解决方案通常使用单独的文本校正组件。 The extensive experimental results show that 1) without geometric supervision, the learned model can automatically generate more “readable” images for both human and the sequence recognition network; 2) the proposed text rectification method can significantly improve recognition accuracies on irregular scene text; 3) the proposed scene text recognition system is competitive compared with the state-of-the-arts.大量的实验结果表明，1)在没有几何监督的情况下，学习模型可以自动为人类和序列识别网络生成更“可读”的图像；2)提出的文本校正方法可以显著提高不规则场景文本的识别准确率；3)与现有技术相比，提出的场景文本识别系统具有竞争力。
7	TPS (10)	[!≈ ti: pi: es]	In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach.在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。 The transformation is a thinplate-spline [6] (TPS) transformation, whose nonlinearity allows us to rectify various types of irregular text, including perspective and curved text.STN的空间变换是一个薄板样条（TPS）变换，这种变换的非线可以纠正各种类型的不规则文本，包括透视和弯曲文本。 The TPS transformation is configured by a set of fiducial points, whose coordinates are regressed by a convolutional neural network.TPS变换是由一组基准点决定，这些基准点的坐标就是由STN这个卷积神经网络回归出来的。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN.因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 The STN transforms an input image I to a rectified image $I^\prime$ with a predicted TPS transformation.STN将输入图像I转换为具有预测的TPS变换的矫正图像$I^\prime$。 Then, inside the grid generator, it calculates the TPS transformation parameters from the fiducial points, and generates a sampling grid on I.然后，在网格生成器内部，它从各个点计算TPS变换参数，并在I上生成采样网格。 The grid generator estimates the TPS transformation parameters, and generates a sampling grid.网格生成器估计TPS变换参数，并生成采样网格。 Figure 3. fiducial points and the TPS transformation.图3.基准点和TPS转换。 The parameters of the TPS transformation is represented by a matrix $T \in \Re^{2 \times (K+3)}$ , which is computed byTPS变换的参数由矩阵表示，其由下式计算 The ﬂexibility of the TPS transformation allows us to transform irregular text images into rectified images that contain regular text.TPS转换的灵活性允许我们将不规则文本图像转换为包含常规文本的矫正图像。
8	sequential (8)	[sɪˈkwenʃl]	It bares some resemblance to a sequential signal.它有点类似于顺序信号。 Given an input image, the encoder generates a sequential feature representation, which is a sequence of feature vectors.编码器将输入的图像表示成序列的特征，即一系列的特征向量。 Su and Lu [34] extract sequential image representation, which is a sequence of HOG [10] descriptors, and predict the corresponding character sequence with a recurrent neural network (RNN).Su和Lu [34]提取序列图像表示，它是HOG [10]描述符的序列，并用递归神经网络（RNN）预测相应的字符序列。 We extract a sequential representation from $I^\prime$ , and recognize a word from it.我们从$I^\prime$中提取顺序表示，并从中识别出一个单词。 The encoder extracts a sequential representation from the input image $I^\prime$.编码器从输入图像$I^\prime$中提取顺序表示。 The decoder recurrently generates a sequence conditioned on the sequential representation, by decoding the relevant contents it attends to at each step.解码器通过解码在每个步骤中所关注的相关内容，循环地生成以顺序表示为条件的序列。 A na¨ıve approach for extracting a sequential representation for $I^\prime$ is to take local image patches from left to right, and describe each of them with a CNN.用于提取$I^\prime$的顺序表示的一种简单方法是从左到右获取图像中的局部图像块，并用CNN描述每个图像块。 Structure of the SRN, which consists of an encoder and a decoder. The encoder uses several convolution layers (ConvNet) and a two-layer BLSTM network to extract a sequential representation (h) for the input image.编码器使用几个卷积层（ConvNet）和两层BLSTM网络来提取输入图像的顺序表示（h）。
9	SVT-perspective (8)		SVT-Perspective [29] is specifically designed for evaluating performance of perspective text recognition algorithms.SVT-Perspective [29]专门用于评估透视文本识别算法的性能。 Text samples in SVT-Perspective are picked from side view angles in Google Street View, thus most of them are heavily deformed by perspective distortion.SVT-Perspective中的文本样本是从Google街景中的侧视角中选取的，因此大多数文本样本都因透视变形而严重变形。 a. SVT-Perspective consists of 639 cropped images for testing. SVT-Perspective由639个裁剪图像组成，用于测试。 Samples are taken from the SVT-Perspective [29] dataset; b) Curved text. Samples are taken from the CUTE80 [30] dataset.样本取自CUTE80 [30]数据集。 For comparison, we test the CRNN model [32] on SVT-Perspective.为了比较，我们在SVT-Perspective上测试CRNN模型[32]。 Table 2. Recognition accuracies on SVT-Perspective [29].表2. SVT-Perspective的识别准确度[29]。 The reason is that the SVT-perspective dataset mainly consists of perspective text, which is inappropriate for direct recognition.原因是SVT透视数据集主要由透视文本组成，不适合直接识别。 The first five rows are taken from SVT-Perspective [29], the rest rows are taken from CUTE80 [30].前五行取自SVT-透视图[29]，其余行取自CUTE80[30]。
10	transformer (7)	[trænsˈfɔ:mə(r)]	RARE is a specially designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN).RARE是一种特殊设计的深度神经网络，由空间变换网络（STN）和序列识别网络（SRN）组成。 Figure 1. Schematic overview of RARE, which consists a spatial transformer network (STN) and a sequence recognition network (SRN).图1. RARE的示意图，包括空间变换器网络（STN）和序列识别网络（SRN）。 Specifically, we construct a deep neural network that combines a Spatial Transformer Network [18] (STN) and a Sequence Recognition Network (SRN).具体而言，我们构建了一个深度神经网络，它结合了空间变换器网络[18]（STN）和序列识别网络（SRN）。 3.1. Spatial Transformer Network3.1 空间变换网络 Spatial Transformer Network The localization network of STN has 4 convolution layers, each followed by a $2 \times 2$ max-pooling layer.空间变换器网络STN的定位网络有4个卷积层，每个卷层都有一个$2 \times 2$最大池层。 We address this problem in a more feasible and elegant way by adopting a differentiable spatial transformer network module.我们通过采用可区分的空间变换网络模块，以一种更可行和更优雅的方式解决了这个问题。 In addition, the spatial transformer network is connected to an attention-based sequence recognizer, allowing us to train the whole model end-to-end.此外，空间变换器网络连接到基于注意力的序列识别器，允许我们端到端地训练整个模型。
11	recurrent (7)	[rɪˈkʌrənt]	Su and Lu [34] extract sequential image representation, which is a sequence of HOG [10] descriptors, and predict the corresponding character sequence with a recurrent neural network (RNN).Su和Lu [34]提取序列图像表示，它是HOG [10]描述符的序列，并用递归神经网络（RNN）预测相应的字符序列。 Instead, following [32], we build a network that combines convolutional layers and recurrent networks.相反，在[32]之后，我们构建了一个结合了卷积层和循环网络的网络。 The BLSTM is a recurrent network that can analyze the dependencies within a sequence in both directions, it outputs another sequence which has the same length as the input one.BLSTM是一个循环网络，可以在两个方向上分析序列中的依赖关系，它输出另一个序列，其长度与输入序列相同。 3.2.2 Decoder: Recurrent Character Generator3.2.2解码器：循环字符发生器 It is a recurrent neural network with the attention structure proposed in [4, 8].它是一种递归神经网络，具有[4,8]中提出的注意结构。 In the recurrency part, we adopt the Gated Recurrent Unit (GRU) [7] as the cell.在重发部分，我们采用门控循环单元（GRU）[7]作为单元。 The state $s_{t-1}$ is updated via the recurrent process of GRU [7, 8]:状态$s_{t-1}$通过GRU [7,8]的循环过程更新：
12	recognizer (6)	['rekəgnaɪzə]	Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text.与文档中的文字不同，自然图像中的文字通常具有不规则形状，这是由透视扭曲，弯曲字符放置等引起的。我们提出了RARE（具有自动重整功能的可靠文本识别器），这是一种对不规则文本具有可靠性的识别模型。 Usually, a text recognizer works best when its input images contain tightly-bounded regular text.通常，文本识别器在其输入图像包含紧密有界的常规文本时效果最佳。 This motivates us to apply a spatial transformation prior to recognition, in order to rectify input images into ones that are more “readable” by recognizers.这促使我们在识别之前应用空间变换，以便将输入图像校正为识别器更“可读”的图像。 The STN is able to rectify images that contain these types of irregular text, making them more readable for the following recognizer.STN能够纠正包含这些类型的不规则文本的图像，使其对于以下识别器更具可读性。 1, we see that the SRN-only model is also a very competitive recognizer, achieving higher or competitive performance on most of the benchmarks. 1，我们看到仅SRN模型也是一个非常有竞争力的识别器，在大多数基准测试中实现了更高或更具竞争力的性能。 In addition, the spatial transformer network is connected to an attention-based sequence recognizer, allowing us to train the whole model end-to-end.此外，空间变换器网络连接到基于注意力的序列识别器，允许我们端到端地训练整个模型。
13	sampler (6)	[ˈsɑ:mplə(r)]	The sampler takes both the grid and the input image, it produces a rectified image $I^\prime$ by sampling on the grid points.采样器同时采用网格和输入图像，通过对网格点进行采样，生成一个矫正的图像$I^\prime$。 A distinctive property of STN is that its sampler is differentiable.STN的一个独特属性是其采样器是可微分的。 Structure of the STN. The localization network localizes a set of fiducial points C, with which the grid generator generates a sampling grid P. The sampler produces a rectified image $I^\prime$ , given I and P.定位网络定位一组特定点C，网格生成器利用该集合点生成采样网格P.给定I和P时，采样器产生一个矫正的图像$I^\prime$。 3.1.3 Sampler3.1.3 采样器 Lastly, in the sampler, the pixel value of $p^\prime_i$ is bilinearly interpolated from the pixels near $p_i$ on the input image.最后，在采样器中，$p^\prime_i$的像素值是从输入图像上的$p_i$附近的像素进行双线性插值。 where $V$ represents the bilinear sampler [18], which is also a differentiable module.其中V代表双线性采样器[18]，它也是一个可微分的模块。
14	differentiable (6)	[ˌdɪfə'renʃɪəbl]	A distinctive property of STN is that its sampler is differentiable.STN的一个独特属性是其采样器是可微分的。 Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained.因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。 Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained.因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。 The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable.网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。 where $V$ represents the bilinear sampler [18], which is also a differentiable module.其中V代表双线性采样器[18]，它也是一个可微分的模块。 We address this problem in a more feasible and elegant way by adopting a differentiable spatial transformer network module.我们通过采用可区分的空间变换网络模块，以一种更可行和更优雅的方式解决了这个问题。
15	eo (5)		The decoder generates a character sequence (including the EOS token) conditioned on h.解码器生成以h为条件的字符序列（包括EOS令牌）。 The label space includes all English alphanumeric characters, plus a special “end-ofsequence” (EOS) token, which ends the generation process.标签空间包括所有英文字母数字字符，以及一个特殊的“结束序列”（EOS）令牌，它结束生成过程。 A prefix tree of three words: “ten”, “tea”, and “to”. $\epsilon$ and $\Omega$ are the tree root and the EOS token respectively. $\epsilon$和$\Omega$分别是树根和EOS令牌。 Nodes on a path from the root to a leaf forms a word (including the EOS).从根到叶子的路径上的节点形成一个单词（包括EOS）。 For the decoder, we use a GRU cell that has 256 memory blocks and 37 output units (26 letters, 10 digits, and 1 EOS token).对于解码器，我们使用具有256个存储器块和37个输出单元（26个字母，10个数字和1个EOS令牌）的GRU单元。
16	Tab (5)	[tæb]	In Tab. 1 we report our results, and compare them with other methods.在表1中，我们报告实验结果，并与其他方法进行比较。 As reported in the last row of Tab.正如Tab的最后一行所报道的那样。 Tab. 2 summarizes the results.表 2总结了结果。 Furthermore, recall the results in Tab. 1, on SVTPerspective RARE outperforms [32] by a even larger margin.此外，回顾表1中的结果，在SVTP上，RARE的表现优于[32]达到更大的余地。 From the results summarized in Tab. 3, we see that RARE outperforms the other two methods by a large margin.从表3中总结的结果，我们看到RARE的性能远远超过其他两种方法。
17	readable (4)	[ˈri:dəbl]	In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach.在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。 This motivates us to apply a spatial transformation prior to recognition, in order to rectify input images into ones that are more “readable” by recognizers.这促使我们在识别之前应用空间变换，以便将输入图像校正为识别器更“可读”的图像。 The STN is able to rectify images that contain these types of irregular text, making them more readable for the following recognizer.STN能够纠正包含这些类型的不规则文本的图像，使其对于以下识别器更具可读性。 The extensive experimental results show that 1) without geometric supervision, the learned model can automatically generate more “readable” images for both human and the sequence recognition network; 2) the proposed text rectification method can significantly improve recognition accuracies on irregular scene text; 3) the proposed scene text recognition system is competitive compared with the state-of-the-arts.大量的实验结果表明，1)在没有几何监督的情况下，学习模型可以自动为人类和序列识别网络生成更“可读”的图像；2)提出的文本校正方法可以显著提高不规则场景文本的识别准确率；3)与现有技术相比，提出的场景文本识别系统具有竞争力。
18	i.e. (4)	[ˌaɪ ˈi:]	Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN.因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 According to the translation invariance property of CNN, each vector corresponds to a local image region, i.e. receptive field, and is a descriptor for that region.根据CNN的平移不变性，每个矢量对应于局部图像区域，即接收场，并且是该区域的描述符。 where $l_{t-1}$ is the $t-1$-th ground-truth label in training, while in testing, it is the label predicted in the previous step, i.e. $\hat l_{t-1}$ .其中$l_{t-1}$是训练中第$t-1$个真实标签，而在测试中，它是上一步中预测的标签，即$\hat l_{t-1}$。 When a test image is associated with a lexicon, i.e. a set of words for selection, the recognition process is to pick the word with the highest posterior conditional probability:当测试图像与词典相关联时，即一组用于选择的单词时，识别过程是选择具有最高后验条件概率的单词：
19	descriptor (4)	[dɪˈskrɪptə(r)]	Su and Lu [34] extract sequential image representation, which is a sequence of HOG [10] descriptors, and predict the corresponding character sequence with a recurrent neural network (RNN).Su和Lu [34]提取序列图像表示，它是HOG [10]描述符的序列，并用递归神经网络（RNN）预测相应的字符序列。 Yao et al. [38] firstly propose the multi-oriented text detection problem, and deal with it by carefully designing rotation-invariant region descriptors.姚等人[38]首先提出了多方向文本检测问题，并通过仔细设计旋转不变区域描述符来处理它。 Phan et al. propose to explicitly rectify perspective distortions via SIFT [23] descriptor matching.潘等人建议通过SIFT [23]描述符匹配明确纠正透视失真。 According to the translation invariance property of CNN, each vector corresponds to a local image region, i.e. receptive field, and is a descriptor for that region.根据CNN的平移不变性，每个矢量对应于局部图像区域，即接收场，并且是该区域的描述符。
20	Eq (4)		The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable.网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。 The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable.网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。 where the probability $p(\cdot)$ is computed by Eq. 8, $\theta$ is the parameters of both STN and SRN.其中概率$p(\cdot)$由方程式8计算，$\theta$是STN和SRN的参数。 However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words.但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。
21	BLSTM (4)	[!≈ bi: el es ti: em]	Structure of the SRN, which consists of an encoder and a decoder. The encoder uses several convolution layers (ConvNet) and a two-layer BLSTM network to extract a sequential representation (h) for the input image.编码器使用几个卷积层（ConvNet）和两层BLSTM网络来提取输入图像的顺序表示（h）。 We further apply a two-layer Bidirectional Long-Short Term Memory (BLSTM) [14, 13] network to the sequence, in order to model the long-term dependencies within the sequence.我们进一步将两层双向长短期记忆（BLSTM）[14,13]网络应用于序列，以模拟序列内的长期依赖性。 The BLSTM is a recurrent network that can analyze the dependencies within a sequence in both directions, it outputs another sequence which has the same length as the input one.BLSTM是一个循环网络，可以在两个方向上分析序列中的依赖关系，它输出另一个序列，其长度与输入序列相同。 On the top of the convolutional layers is a two-layer BLSTM network, each LSTM has 256 hidden units.在卷积层的顶部是两层BLSTM网络，每个LSTM具有256个隐藏单元。
22	GRU (4)	[!≈ dʒi: ɑ:(r) ju:]	In the recurrency part, we adopt the Gated Recurrent Unit (GRU) [7] as the cell.在重发部分，我们采用门控循环单元（GRU）[7]作为单元。 where $s_{t-1}$ is the state variable of the GRU cell at the last step.其中$s_{t-1}$是最后一步GRU单元的状态变量。 The state $s_{t-1}$ is updated via the recurrent process of GRU [7, 8]:状态$s_{t-1}$通过GRU [7,8]的循环过程更新： For the decoder, we use a GRU cell that has 256 memory blocks and 37 output units (26 letters, 10 digits, and 1 EOS token).对于解码器，我们使用具有256个存储器块和37个输出单元（26个字母，10个数字和1个EOS令牌）的GRU单元。
23	posterior (4)	[pɒˈstɪəriə(r)]	At each step the posterior probabilities of all child nodes are computed.在每个步骤中，计算所有子节点的后验概率。 Numbers on the edges are the posterior probabilities.边缘上的数字是后验概率。 When a test image is associated with a lexicon, i.e. a set of words for selection, the recognition process is to pick the word with the highest posterior conditional probability:当测试图像与词典相关联时，即一组用于选择的单词时，识别过程是选择具有最高后验条件概率的单词： In testing, we start from the root node, every time the model outputs a distribution $\hat y_t$ , the child node with the highest posterior probability is selected as the next node to move to.在测试中，我们从根节点开始，每次模型输出分布$\hat y_t$时，选择具有最高后验概率的子节点作为要移动到的下一个节点。
24	IIIT5K (4)		• IIIT 5K-Words [25] (IIIT5K) contains 3000 cropped word images for testing.•IIIT 5K-Words [25]（IIIT5K）包含3000个用于测试的裁剪单词图像。 On IIIT5K, RARE outperforms prior art CRNN [32] by nearly 4 percentages, indicating a clear improvement in performance.在IIIT5K上，RARE的性能比现有技术CRNN [32]高出近4个百分点，表明性能明显提高。 We observe that IIIT5K contains a lot of irregular text, especially curved text, while RARE has an advantage in dealing with irregular text.我们观察到IIIT5K包含大量不规则文本，尤其是弯曲文本，而RARE在处理不规则文本方面具有优势。 On IIIT5K, SVT and IC03, constrained recognition accuracies are on par with [17], and slightly lower than [32].在IIIT5K，SVT和IC03上，约束识别精度与[17]相当，略低于[32]。
25	SVT (4)	[!≈ es vi: ti:]	• Street View Text [35] (SVT) is collected from Google Street View.•街景文字[35]（SVT）是从Google街景中收集的。 Many images in SVT are severely corrupted by noise and blur, or have very low resolutions. SVT中的许多图像受到噪声和模糊的严重破坏，或者具有非常低的分辨率。 On IIIT5K, SVT and IC03, constrained recognition accuracies are on par with [17], and slightly lower than [32].在IIIT5K，SVT和IC03上，约束识别精度与[17]相当，略低于[32]。 Each image is associated with a 50-word lexicon, which is inherited from the SVT [35] dataset.每个图像都与一个50字的词典相关联，该词典继承自SVT [35]数据集。
26	IC03 (4)		• ICDAR 2003 [24] (IC03) contains 860 cropped word images, each associated with a 50-word lexicon defined by Wang et al. [35].•ICDAR 2003 [24]（IC03）包含860个裁剪的单词图像，每个图像与Wang等人[35]定义的50个单词的词典相关联。 • ICDAR 2013 [20] (IC13) inherits most of its samples from IC03.•ICDAR 2013 [20]（IC13）继承了IC03的大部分样本。 After filtering samples as done in IC03, the dataset contains 857 samples.在IC03中完成过滤样品后，数据集包含857个样品。 On IIIT5K, SVT and IC03, constrained recognition accuracies are on par with [17], and slightly lower than [32].在IIIT5K，SVT和IC03上，约束识别精度与[17]相当，略低于[32]。
27	CUTE80 (4)		Samples are taken from the SVT-Perspective [29] dataset; b) Curved text. Samples are taken from the CUTE80 [30] dataset.样本取自CUTE80 [30]数据集。 The first five rows are taken from SVT-Perspective [29], the rest rows are taken from CUTE80 [30].前五行取自SVT-透视图[29]，其余行取自CUTE80[30]。 CUTE80 [30] focuses on the recognition of curved text.CUTE80 [30]专注于弯曲文本的识别。 Table 3. Recognition accuracies on CUTE80 [29].表3.CUTE80上的识别准确度[29]。
28	e.g. (3)	[ˌi: ˈdʒi:]	In natural scenes, text appears on various kinds of objects, e.g. road signs, billboards, and product packaging.在自然场景中，文本出现在各种对象上，例如道路标志，广告牌和产品包装。 However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words.但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。 In the future, we plan to address the end-to-end scene text reading problem through the combination of RARE with a scene text detection method, e.g. [43].未来，我们计划通过将RARE与场景文本检测方法相结合来解决端到端场景文本阅读问题，例如[43]。
29	recurrently (3)	[rɪ'kʌrəntlɪ]	The decoder recurrently generates a character sequence conditioning on the input sequence, by decoding the relevant contents which are determined by its attention mechanism at each step.解码器会根据注意力机制进行解码，循序地生成识别出的字符序列。 The decoder recurrently generates a sequence conditioned on the sequential representation, by decoding the relevant contents it attends to at each step.解码器通过解码在每个步骤中所关注的相关内容，循环地生成以顺序表示为条件的序列。 The decoder recurrently generates a sequence of characters, conditioned on the sequence produced by the encoder.解码器以编码器产生的序列为条件，反复生成一系列字符。
30	leverage (3)	[ˈli:vərɪdʒ]	Zhang et al. [42] propose a character rectification method that leverages the low-rank structures of text.张等人 [42]提出了一种利用文本的低等级结构的字符整理方法。 Besides, the spatial dependencies between the patches are not exploited and leveraged.此外，图像块之间的空间依赖性未被利用和利用。 Restricted by the sizes of the receptive fields, the feature sequence leverages limited image contexts.受接收场的大小限制，特征序列利用有限的图像上下文。
31	marker (3)	[ˈmɑ:kə(r)]	Green markers on the left image are the fiducial points C.左侧图像上的绿色标记是C点。 Cyan markers on the right image are the base fiducial points $C^\prime$.右侧图像上的青色标记是$C^\prime$的基本特征点。 Green markers are the predicted fiducial points on the input images.绿色标记是输入图像上预测的特定点。
32	iterate (3)	[ˈɪtəreɪt]	By iterating over all points in $P^\prime$ , we generate a grid $P=\{p_i\}_{i=1,\cdots,N}$ on the input image $I$.通过迭代$P^\prime$中的所有点，我们在输入图像$I$上生成网格$P=\{p_i\}_{i=1,\cdots,N}$。 The process iterates until a leaf node is reached.该过程将迭代，直到到达叶节点。 However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words.但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。
33	arbitrary (3)	[ˈɑ:bɪtrəri]	The network extracts a sequence of feature vectors, given an input image of arbitrary size.给定任意大小的输入图像，网络提取特征向量序列。 Both input and output sequences may have arbitrary lengths.输入和输出序列都可以具有任意长度。 [32] is able to recognize arbitrary words, but it does not have a specific mechanism for handling curved text.[32]能够识别任意单词，但没有处理弯曲文本的特定机制。
34	prefix (3)	[ˈpri:fɪks]	A prefix tree of three words: “ten”, “tea”, and “to”. $\epsilon$ and $\Omega$ are the tree root and the EOS token respectively. $\epsilon$和$\Omega$分别是树根和EOS令牌。 The motivation is that computation can be shared among words that share the same prefix.动机是计算可以在共享相同前缀的单词之间共享。 We first construct a prefix tree over a given lexicon.我们首先在给定的词典上构建一个pre fi x树。
35	placement (2)	[ˈpleɪsmənt]	Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text.与文档中的文字不同，自然图像中的文字通常具有不规则形状，这是由透视扭曲，弯曲字符放置等引起的。我们提出了RARE（具有自动重整功能的可靠文本识别器），这是一种对不规则文本具有可靠性的识别模型。 Due to its irregular character placement, recognizing curved text is very challenging.由于其不规则的字符放置，识别弯曲文本是非常具有挑战性的。
36	side-view (2)	['saɪdvj'u:]	For example, some scene text is perspective text [29], which is caused by side-view camera angles; some has curved shapes, meaning that its characters are placed along curves rather than straight lines.例如，一些场景文本是透视文本[29]，它是由侧视摄像机角度引起的;有些具有弯曲的形状，这意味着它的角色沿着曲线而不是直线放置。 In fig. 4, we show some common types of irregular text, including a) loosely-bounded text, which resulted by imperfect text detection; b) multi-oriented text, caused by non-horizontal camera views; c) perspective text, caused by side-view camera angles; d) curved text, a commonly seen artistic style.在图4中，我们展示了一些常见类型的不规则文本，包括a）松散有界的文本，这是由不完美的文本检测引起的; b）由非水平摄像机视图引起的多向文本; c）由侧视摄像机角度引起的透视文本; d）弯曲的文字，一种常见的艺术风格。
37	back-propagated (2)	[!≈ bæk ˈprɔpəɡeitid]	The dashed lines represent the ﬂows of the back-propagated gradients.虚线表示反向传播的梯度的流动。 Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN.因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。
38	ideally (2)	[aɪ'di:əlɪ]	Ideally, the STN produces an image that contains regular text, which is a more appropriate input for the SRN than the original one.在理想情况下，STN产生的图像是一类常规的文本图像，这比原来的不规则的文本图像更合适输入到SRN中。 The input to the SRN is a rectified image $I^\prime$ , which ideally contains a word that is written horizontally from left to right.SRN的输入是一个矫正的图像$I^\prime$，理想情况下包含一个从左到右水平写入的单词。
39	regress (2)	[rɪˈgres]	The TPS transformation is configured by a set of fiducial points, whose coordinates are regressed by a convolutional neural network.TPS变换是由一组基准点决定，这些基准点的坐标就是由STN这个卷积神经网络回归出来的。 The localization network localizes K fiducial points by directly regressing their $x,y$ -coordinates.本地化网络通过直接回归其 $x,y$-coordinates来定位K个点。
40	decode (2)	[ˌdi:ˈkəʊd]	The decoder recurrently generates a character sequence conditioning on the input sequence, by decoding the relevant contents which are determined by its attention mechanism at each step.解码器会根据注意力机制进行解码，循序地生成识别出的字符序列。 The decoder recurrently generates a sequence conditioned on the sequential representation, by decoding the relevant contents it attends to at each step.解码器通过解码在每个步骤中所关注的相关内容，循环地生成以顺序表示为条件的序列。
41	geometric (2)	[ˌdʒi:əˈmetrɪk]	Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN.因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 The extensive experimental results show that 1) without geometric supervision, the learned model can automatically generate more “readable” images for both human and the sequence recognition network; 2) the proposed text rectification method can significantly improve recognition accuracies on irregular scene text; 3) the proposed scene text recognition system is competitive compared with the state-of-the-arts.大量的实验结果表明，1)在没有几何监督的情况下，学习模型可以自动为人类和序列识别网络生成更“可读”的图像；2)提出的文本校正方法可以显著提高不规则场景文本的识别准确率；3)与现有技术相比，提出的场景文本识别系统具有竞争力。
42	differential (2)	[ˌdɪfəˈrenʃl]	Consequently, for the STN, we do not need to label any geometric ground truth, i.e. the positions of the TPS fiducial points, but let its training be supervised by the error differentials back-propagated by the SRN.因此，对于STN，我们不需要标注任何几何位置，比如，TPS基准点的位置。 STN的训练可以由SRN反向传播的误差进行迭代。 Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained.因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。
43	convolutional-recurrent (2)	[!≈ kɒnvə'lu:ʃənəl rɪˈkʌrənt]	Third, our model adopts a convolutional-recurrent structure in the encoder of the SRN, thus is a novel variant of the attention-based model [4].第三，在SRN的编码器中，我们采用卷积循环结构，这是注意力模型的一种新颖的变体。 3.2.1 Encoder: Convolutional-Recurrent Network3.2.1编码器：卷积 - 循环网络
44	Jaderberg (2)		Jaderberg et al. [17] address text recognition with a 90k-class convolutional neural network, where each class corresponds to an English word.Jaderberg等人[17]使用90k级卷积神经网络进行文本识别，其中每个类对应一个英语单词。 Our model is trained on the 8-million synthetic samples released by Jaderberg et al. [15].我们的模型在Jaderberg等人[15]发布的800万个合成样本上进行训练。
45	unconstrained (2)	[ˌʌnkən'streɪnd]	In [16], a CNN with a structured output layer is constructed for unconstrained text recognition.在[16]中，构造具有结构化输出层的CNN用于无约束文本识别。 On unconstrained recognition tasks (recognizing without a lexicon), our model outperforms all the other methods in comparison.在无约束的识别任务（没有词典识别）的情况下，我们的模型在比较中优于所有其他方法。
46	back-propagate (2)	[!≈ bæk ˈprɒpəgeɪt]	Therefore, once we have a differentiable localization network and a differentiable grid generator, the STN can back-propagate error differentials and gets trained.因此，一旦我们拥有可区分的定位网络和可微分的网格生成器，STN就可以反向传播误差并进行训练。 The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable.网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。
47	normalize (2)	[ˈnɔ:məlaɪz]	We use a normalized coordinate system whose origin is the image center, so that $x_k, y_k$ are within the range of $[-1, 1]$ .我们使用归一化坐标系，其原点是图像中心，因此$x_k, y_k$在$[-1, 1]$的范围内。 Since K is a constant and the coordinate system is normalized, $C^\prime$ is always a constant.由于K是常数并且坐标系被归一化，因此$C^\prime$始终是常数。
48	euclidean (2)	[ju:ˈklidiən]	where the element on the i-th row and j-th column of R is $r_{i,j}=d_{i,j}^2$ , $d_{i,j}$ is the euclidean distance between $c_i^\prime$ and $c_j^\prime$ .其中R的第i行和第j列的元素是$r_{i,j}=d_{i,j}^2$，$d_{i,j}$是$c_i^\prime$和$c_j^\prime$之间的欧氏距离。 where $d_{i,k}$ is the euclidean distance between $p^\prime_i$ and the $k$-th base fiducial point $c^\prime_k$ .其中$d_{i,k}$是$p^\prime_i$和第k个基本点$c^\prime_k$之间的欧氏距离。
49	loosely-bounded (2)	[!≈ ˈlu:sli 'baʊndɪd]	In fig. 4, we show some common types of irregular text, including a) loosely-bounded text, which resulted by imperfect text detection; b) multi-oriented text, caused by non-horizontal camera views; c) perspective text, caused by side-view camera angles; d) curved text, a commonly seen artistic style.在图4中，我们展示了一些常见类型的不规则文本，包括a）松散有界的文本，这是由不完美的文本检测引起的; b）由非水平摄像机视图引起的多向文本; c）由侧视摄像机角度引起的透视文本; d）弯曲的文字，一种常见的艺术风格。 The STN can deal with several types of irregular text, including (a) loosely-bounded text; (b) multi-oriented text; (c) perspective text; (d) curved text.STN可以处理几种类型的不规则文本，包括（a）松散有界的文本; （b）多方面文本; （c）透视文本; （d）弯曲文本。
50	receptive (2)	[rɪˈseptɪv]	According to the translation invariance property of CNN, each vector corresponds to a local image region, i.e. receptive field, and is a descriptor for that region.根据CNN的平移不变性，每个矢量对应于局部图像区域，即接收场，并且是该区域的描述符。 Restricted by the sizes of the receptive fields, the feature sequence leverages limited image contexts.受接收场的大小限制，特征序列利用有限的图像上下文。
51	log-likelihood (2)	[!≈ lɒg ˈlaɪklihʊd]	To train the model, we minimize the negative log-likelihood over X :为了训练模型，我们最小化X上的负对数似然： After each step, the list is updated to store the nodes with top-B accumulated log-likelihoods, where B is the beam width.在每个步骤之后，更新列表以存储具有前B累积对数似然的节点，其中B是波束宽度。
52	convergence (2)	[kən'vɜ:dʒəns]	The optimization algorithm is the ADADELTA [41], which we find fast in convergence speed.优化算法是ADADELTA [41]，我们发现其收敛速度很快。 Randomly initializing the localization network results in failure of convergence during training.随机初始化定位网络导致训练期间收敛失败。
53	Hunspell (2)		However, on very large lexicons, e.g. the Hunspell [1] which contains more than 50k words, computing Eq. 10 is time consuming, as it requires iterating over all lexicon words.但是，对于非常大的词典，例如Hunspell [1]包含超过50k字，计算Eq。 10是耗时的，因为它需要迭代所有词典单词。 Besides, there is a “full lexicon” which contains all lexicon words, and the Hunspell [1] lexicon which has 50k words.此外，还有一个包含所有词典单词的“完整词典”和包含50k单词的Hunspell [1]词典。
54	synthetic (2)	[sɪnˈθetɪk]	Our model is trained on the 8-million synthetic samples released by Jaderberg et al. [15].我们的模型在Jaderberg等人[15]发布的800万个合成样本上进行训练。 We use the same model trained on the synthetic dataset without fine-tuning.我们使用在合成数据集上训练的相同模型而不进行微调。
55	ICDAR (2)	[!≈ aɪ si: di: eɪ ɑ:(r)]	• ICDAR 2003 [24] (IC03) contains 860 cropped word images, each associated with a 50-word lexicon defined by Wang et al. [35].•ICDAR 2003 [24]（IC03）包含860个裁剪的单词图像，每个图像与Wang等人[35]定义的50个单词的词典相关联。 • ICDAR 2013 [20] (IC13) inherits most of its samples from IC03.•ICDAR 2013 [20]（IC13）继承了IC03的大部分样本。
56	alleviate (2)	[əˈli:vieɪt]	Our rectification scheme can significantly alleviate this problem.我们的整改计划可以显着缓解这一问题。 Generally, the rectification made by the STN is not perfect, but it alleviates the recognition difficulty to some extent.一般来说，STN所做的纠正并不完美，但在一定程度上缓解了识别的困难。
57	unsolved (1)	[ˌʌnˈsɒlvd]	Recognizing text in natural images is a challenging task with many unsolved problems.识别自然图像中的文本是一项具有挑战性的任务，存在许多未解决的问题。
58	Thin-Plate-Spline (1)	[!≈ θɪn pleɪt splaɪn]	In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more “readable” image for the following SRN, which recognizes text through a sequence recognition approach.在测试中，图像首先通过预测的薄板样条（TPS）插值变换矫正为更加“可读”的图像，用于后续SRN，通过序列识别方法识别文本。
59	trainable (1)	[t'reɪnəbl]	RARE is end-to-end trainable, requiring only images and associated text labels, making it convenient to train and deploy the model in practical systems.RARE是端到端的可训练的，只需要图像和相关的文本标签，便于在实际系统中训练和部署模型。
60	billboard (1)	[ˈbɪlbɔ:d]	In natural scenes, text appears on various kinds of objects, e.g. road signs, billboards, and product packaging.在自然场景中，文本出现在各种对象上，例如道路标志，广告牌和产品包装。
61	semantic (1)	[sɪˈmæntɪk]	It carries rich and high-level semantic information that is important for image understanding.它携带丰富的高级语义信息，这对于图像理解非常重要。
62	real-world (1)	[!≈ ˈri:əl wɜ:ld]	Recognizing text in images facilitates many real-world applications, such as geolocation, driverless car, and image-based machine translation.识别图像中的文本有助于许多实际应用，例如地理定位，无人驾驶汽车和基于图像的机器翻译。
63	geolocation (1)	[dʒɪɒləʊ'keɪʃn]	Recognizing text in images facilitates many real-world applications, such as geolocation, driverless car, and image-based machine translation.识别图像中的文本有助于许多实际应用，例如地理定位，无人驾驶汽车和基于图像的机器翻译。
64	driverless (1)	[d'raɪvərles]	Recognizing text in images facilitates many real-world applications, such as geolocation, driverless car, and image-based machine translation.识别图像中的文本有助于许多实际应用，例如地理定位，无人驾驶汽车和基于图像的机器翻译。
65	frontal (1)	[ˈfrʌntl]	We call such text irregular text, in contrast to regular text which is horizontal and frontal.我们将此类文本称为不规则文本，与常规文本（水平和正面）形成对比。
66	Schematic (1)	[ski:ˈmætɪk]	Figure 1. Schematic overview of RARE, which consists a spatial transformer network (STN) and a sequence recognition network (SRN).图1. RARE的示意图，包括空间变换器网络（STN）和序列识别网络（SRN）。
67	jointly (1)	[dʒɔɪntlɪ]	The two networks are jointly trained by the back-propagation algorithm [22].这两个网络由反向传播算法共同训练[22]。
68	dash (1)	[dæʃ]	The dashed lines represent the ﬂows of the back-propagated gradients.虚线表示反向传播的梯度的流动。
69	ow (1)	[aʊ]	The dashed lines represent the ﬂows of the back-propagated gradients.虚线表示反向传播的梯度的流动。
70	tightly-bounded (1)	[!≈ ˈtaɪtli 'baʊndɪd]	Usually, a text recognizer works best when its input images contain tightly-bounded regular text.通常，文本识别器在其输入图像包含紧密有界的常规文本时效果最佳。
71	spatially (1)	['speɪʃəlɪ]	In the STN, an input image is spatially transformed into a rectified image.在STN中，输入图像在空间上变换成校正后的图像。
72	thinplate-spline (1)		The transformation is a thinplate-spline [6] (TPS) transformation, whose nonlinearity allows us to rectify various types of irregular text, including perspective and curved text.STN的空间变换是一个薄板样条（TPS）变换，这种变换的非线可以纠正各种类型的不规则文本，包括透视和弯曲文本。
73	nonlinearity (1)	[nɒnlɪnɪ'ærɪtɪ]	The transformation is a thinplate-spline [6] (TPS) transformation, whose nonlinearity allows us to rectify various types of irregular text, including perspective and curved text.STN的空间变换是一个薄板样条（TPS）变换，这种变换的非线可以纠正各种类型的不规则文本，包括透视和弯曲文本。
74	configure (1)	[kənˈfɪgə(r)]	The TPS transformation is configured by a set of fiducial points, whose coordinates are regressed by a convolutional neural network.TPS变换是由一组基准点决定，这些基准点的坐标就是由STN这个卷积神经网络回归出来的。
75	variant (1)	[ˈveəriənt]	Third, our model adopts a convolutional-recurrent structure in the encoder of the SRN, thus is a novel variant of the attention-based model [4].第三，在SRN的编码器中，我们采用卷积循环结构，这是注意力模型的一种新颖的变体。
76	Hough (1)	[hɒk]	Among the traditional methods, many adopt bottom-up approaches, where individual characters are firstly detected using sliding window [36, 35], connected components [28], or Hough voting [39].在传统方法中，许多方法采用自下而上的方法，其中首先使用滑动窗口[36,35]，连通组件[28]或霍夫投票[39]来检测单个字符。
77	Alm´azan (1)		For example, Alm´azan et al. [2] propose to predict label embedding vectors from input images.例如，Alm'azan等人[2]建议从输入图像预测标签嵌入向量。
78	k-class (1)	[!≈ keɪ klɑ:s]	Jaderberg et al. [17] address text recognition with a 90k-class convolutional neural network, where each class corresponds to an English word.Jaderberg等人[17]使用90k级卷积神经网络进行文本识别，其中每个类对应一个英语单词。
79	HOG (1)	[hɒg]	Su and Lu [34] extract sequential image representation, which is a sequence of HOG [10] descriptors, and predict the corresponding character sequence with a recurrent neural network (RNN).Su和Lu [34]提取序列图像表示，它是HOG [10]描述符的序列，并用递归神经网络（RNN）预测相应的字符序列。
80	rotation-invariant (1)	[!≈ rəʊˈteɪʃn ɪnˈveəriənt]	Yao et al. [38] firstly propose the multi-oriented text detection problem, and deal with it by carefully designing rotation-invariant region descriptors.姚等人[38]首先提出了多方向文本检测问题，并通过仔细设计旋转不变区域描述符来处理它。
81	Phan (1)	[fæn]	Phan et al. propose to explicitly rectify perspective distortions via SIFT [23] descriptor matching.潘等人建议通过SIFT [23]描述符匹配明确纠正透视失真。
82	SIFT (1)	[sɪft]	Phan et al. propose to explicitly rectify perspective distortions via SIFT [23] descriptor matching.潘等人建议通过SIFT [23]描述符匹配明确纠正透视失真。
83	above-mentioned (1)	[ə'bʌv 'menʃnd]	The above-mentioned work brings insightful ideas into this issue.上述工作为这个问题带来了深刻的见解。
84	insightful (1)	[ˈɪnsaɪtfʊl]	The above-mentioned work brings insightful ideas into this issue.上述工作为这个问题带来了深刻的见解。
85	annotation (1)	[ˌænə'teɪʃn]	Moreover, it does not require extra annotations for the rectification process, since the STN is supervised by the SRN during training.此外，它不需要额外的注释用于整理过程，因为STN在训练期间由SRN监督。
86	propagate (1)	[ˈprɒpəgeɪt]	Instead, the training of the localization network is completely supervised by the gradients propagated by the other parts of the STN, following the back-propagation algorithm [22].相反，定位网络的训练完全受到STN其他部分传播的梯度的监督，遵循反向传播算法[22]。
87	Cyan (1)	[ˈsaɪən]	Cyan markers on the right image are the base fiducial points $C^\prime$.右侧图像上的青色标记是$C^\prime$的基本特征点。
88	multiplication (1)	[ˌmʌltɪplɪˈkeɪʃn]	The grid generator can back-propagate gradients, since its two matrix multiplications, Eq. 1 and Eq. 4, are both differentiable.网格生成器可以反向传播梯度，因为它的两个矩阵乘法，Eq. 1和Eq. 4，都是可区分的。
89	Lastly (1)	[ˈlɑ:stli]	Lastly, in the sampler, the pixel value of $p^\prime_i$ is bilinearly interpolated from the pixels near $p_i$ on the input image.最后，在采样器中，$p^\prime_i$的像素值是从输入图像上的$p_i$附近的像素进行双线性插值。
90	bilinearly (1)	[!≈ baɪ'lɪnɪəli]	Lastly, in the sampler, the pixel value of $p^\prime_i$ is bilinearly interpolated from the pixels near $p_i$ on the input image.最后，在采样器中，$p^\prime_i$的像素值是从输入图像上的$p_i$附近的像素进行双线性插值。
91	interpolate (1)	[ɪnˈtɜ:pəleɪt]	Lastly, in the sampler, the pixel value of $p^\prime_i$ is bilinearly interpolated from the pixels near $p_i$ on the input image.最后，在采样器中，$p^\prime_i$的像素值是从输入图像上的$p_i$附近的像素进行双线性插值。
92	bilinear (1)	[baɪ'lɪnɪə]	where $V$ represents the bilinear sampler [18], which is also a differentiable module.其中V代表双线性采样器[18]，它也是一个可微分的模块。
93	exibility (1)		The ﬂexibility of the TPS transformation allows us to transform irregular text images into rectified images that contain regular text.TPS转换的灵活性允许我们将不规则文本图像转换为包含常规文本的矫正图像。
94	imperfect (1)	[ɪmˈpɜ:fɪkt]	In fig. 4, we show some common types of irregular text, including a) loosely-bounded text, which resulted by imperfect text detection; b) multi-oriented text, caused by non-horizontal camera views; c) perspective text, caused by side-view camera angles; d) curved text, a commonly seen artistic style.在图4中，我们展示了一些常见类型的不规则文本，包括a）松散有界的文本，这是由不完美的文本检测引起的; b）由非水平摄像机视图引起的多向文本; c）由侧视摄像机角度引起的透视文本; d）弯曲的文字，一种常见的艺术风格。
95	inherently (1)	[ɪnˈhɪərəntlɪ]	Since target words are inherently sequences of characters, we model the recognition problem as a sequence recognition problem, and address it with a sequence recognition network.由于目标词本质上是字符序列，我们将识别问题建模为序列识别问题，并用序列识别网络对其进行处理。
96	horizontally (1)	[ˌhɒrɪ'zɒntəlɪ]	The input to the SRN is a rectified image $I^\prime$ , which ideally contains a word that is written horizontally from left to right.SRN的输入是一个矫正的图像$I^\prime$，理想情况下包含一个从左到右水平写入的单词。
97	na¨ıve (1)		A na¨ıve approach for extracting a sequential representation for $I^\prime$ is to take local image patches from left to right, and describe each of them with a CNN.用于提取$I^\prime$的顺序表示的一种简单方法是从左到右获取图像中的局部图像块，并用CNN描述每个图像块。
98	attens (1)		Specifically, the “map-to-sequence” operation takes out the columns of the maps in the left-to-right order, and ﬂattens them into vectors.具体而言，“map-to-sequence”操作以从左到右的顺序取出地图的列，并将fl视为向量。
99	invariance (1)	[ɪn'veərɪəns]	According to the translation invariance property of CNN, each vector corresponds to a local image region, i.e. receptive field, and is a descriptor for that region.根据CNN的平移不变性，每个矢量对应于局部图像区域，即接收场，并且是该区域的描述符。
100	ConvNet (1)		Structure of the SRN, which consists of an encoder and a decoder. The encoder uses several convolution layers (ConvNet) and a two-layer BLSTM network to extract a sequential representation (h) for the input image.编码器使用几个卷积层（ConvNet）和两层BLSTM网络来提取输入图像的顺序表示（h）。
101	Long-Short (1)	[!≈ lɒŋ ʃɔ:t]	We further apply a two-layer Bidirectional Long-Short Term Memory (BLSTM) [14, 13] network to the sequence, in order to model the long-term dependencies within the sequence.我们进一步将两层双向长短期记忆（BLSTM）[14,13]网络应用于序列，以模拟序列内的长期依赖性。
102	analyze (1)	['ænəlaɪz]	The BLSTM is a recurrent network that can analyze the dependencies within a sequence in both directions, it outputs another sequence which has the same length as the input one.BLSTM是一个循环网络，可以在两个方向上分析序列中的依赖关系，它输出另一个序列，其长度与输入序列相同。
103	recurrency (1)		In the recurrency part, we adopt the Gated Recurrent Unit (GRU) [7] as the cell.在重发部分，我们采用门控循环单元（GRU）[7]作为单元。
104	linearly (1)	[ˈliniəli]	Then, a glimpse $g_t$ is computed by linearly combining the vectors in $h$: $g_t=\sum_{i=1}^L\alpha_{ti}h_i$.然后，通过线性组合$h$：$g_t=\sum_{i=1}^L\alpha_{ti}h_i$中的向量来计算一瞥$g_t$。
105	alphanumeric (1)	[ˌælfənju:ˈmerɪk]	The label space includes all English alphanumeric characters, plus a special “end-ofsequence” (EOS) token, which ends the generation process.标签空间包括所有英文字母数字字符，以及一个特殊的“结束序列”（EOS）令牌，它结束生成过程。
106	end-ofsequence (1)		The label space includes all English alphanumeric characters, plus a special “end-ofsequence” (EOS) token, which ends the generation process.标签空间包括所有英文字母数字字符，以及一个特殊的“结束序列”（EOS）令牌，它结束生成过程。
107	ADADELTA (1)	[!≈ eɪ di: eɪ di: i: el ti: eɪ]	The optimization algorithm is the ADADELTA [41], which we find fast in convergence speed.优化算法是ADADELTA [41]，我们发现其收敛速度很快。
108	Empirically (1)	[ɪm'pɪrɪklɪ]	Empirically, we also find that the patterns displayed fig. 6. b and fig. 6. c yield relatively poorer performance.根据经验，我们还发现图6.b和图6.c所示的模式产生相对较差的性能。
109	conditional (1)	[kənˈdɪʃənl]	When a test image is associated with a lexicon, i.e. a set of words for selection, the recognition process is to pick the word with the highest posterior conditional probability:当测试图像与词典相关联时，即一组用于选择的单词时，识别过程是选择具有最高后验条件概率的单词：
110	incorporate (1)	[ɪnˈkɔ:pəreɪt]	Recognition performance could be further improved by incorporating beam search.通过结合波束搜索可以进一步提高识别性能。
111	top-B (1)	[!≈ tɒp bi:]	After each step, the list is updated to store the nodes with top-B accumulated log-likelihoods, where B is the beam width.在每个步骤之后，更新列表以存储具有前B累积对数似然的节点，其中B是波束宽度。
112	resize (1)	[ˌri:ˈsaɪz]	Following [17, 16], images are resized to $100 \times 32$ in both training and testing.在[17,16]之后，在训练和测试中将图像调整为$100 \times 32$。
113	epoch (1)	[ˈi:pɒk]	Our model processes ~160 samples per second during training, and converges in 2 days after ~3 epochs over the training dataset.我们的模型在训练期间每秒处理~160个样本，并且在训练数据集的~3个时期之后的2天内收敛。
114	GPUaccelerated (1)		Most parts of the model are GPUaccelerated.该模型的大多数部分都是GPU加速的。
115	Xeon (1)		All our experiments are carried out on a workstation which has one Intel Xeon(R) E5-2620 2.40GHz CPU, an NVIDIA GTX-Titan GPU, and 64GB RAM.我们所有的实验都是在一个工作站上进行的，该工作站有一个Intel Xeon（R）E5-2620 2.40GHz CPU，一个NVIDIA GTX-Titan GPU和64GB RAM。
116	NVIDIA (1)	[ɪn'vɪdɪə]	All our experiments are carried out on a workstation which has one Intel Xeon(R) E5-2620 2.40GHz CPU, an NVIDIA GTX-Titan GPU, and 64GB RAM.我们所有的实验都是在一个工作站上进行的，该工作站有一个Intel Xeon（R）E5-2620 2.40GHz CPU，一个NVIDIA GTX-Titan GPU和64GB RAM。
117	GTX-Titan (1)		All our experiments are carried out on a workstation which has one Intel Xeon(R) E5-2620 2.40GHz CPU, an NVIDIA GTX-Titan GPU, and 64GB RAM.我们所有的实验都是在一个工作站上进行的，该工作站有一个Intel Xeon（R）E5-2620 2.40GHz CPU，一个NVIDIA GTX-Titan GPU和64GB RAM。
118	RAM (1)	[ræm]	All our experiments are carried out on a workstation which has one Intel Xeon(R) E5-2620 2.40GHz CPU, an NVIDIA GTX-Titan GPU, and 64GB RAM.我们所有的实验都是在一个工作站上进行的，该工作站有一个Intel Xeon（R）E5-2620 2.40GHz CPU，一个NVIDIA GTX-Titan GPU和64GB RAM。
119	k-word (1)	[!≈ keɪ wɜ:d]	With a 50k-word lexicon, the search takes ~200ms per image.使用50k字的词典，每张图像搜索大约需要200毫秒。
120	IIIT (1)	[!≈ aɪ aɪ aɪ ti:]	• IIIT 5K-Words [25] (IIIT5K) contains 3000 cropped word images for testing.•IIIT 5K-Words [25]（IIIT5K）包含3000个用于测试的裁剪单词图像。
121	K-Words (1)	[!≈ keɪ wɜ:dz]	• IIIT 5K-Words [25] (IIIT5K) contains 3000 cropped word images for testing.•IIIT 5K-Words [25]（IIIT5K）包含3000个用于测试的裁剪单词图像。
122	non-alphanumeric (1)	[!≈ nɒn ˌælfənju:ˈmerɪk]	Following [35], we discard images that contain non-alphanumeric characters or have less than three characters.按照[35]，我们丢弃包含非字母数字字符或少于三个字符的图像。
123	IC13 (1)		• ICDAR 2013 [20] (IC13) inherits most of its samples from IC03.•ICDAR 2013 [20]（IC13）继承了IC03的大部分样本。
124	k-dictionary (1)	[!≈ keɪ ˈdɪkʃənri]	[17] only recognizes words that are in its 90k-dictionary.[17]只识别其90k字典中的单词。
125	par (1)	[pɑ:(r)]	On IIIT5K, SVT and IC03, constrained recognition accuracies are on par with [17], and slightly lower than [32].在IIIT5K，SVT和IC03上，约束识别精度与[17]相当，略低于[32]。
126	SRN-only (1)		1, we see that the SRN-only model is also a very competitive recognizer, achieving higher or competitive performance on most of the benchmarks. 1，我们看到仅SRN模型也是一个非常有竞争力的识别器，在大多数基准测试中实现了更高或更具竞争力的性能。
127	validate (1)	[ˈvælɪdeɪt]	To validate the effectiveness of the rectification scheme, we evaluate RARE on the task of perspective text recognition.为了验证整合方案的有效性，我们评估了RARE对透视文本识别的任务。
128	deformed (1)	[dɪˈfɔ:md]	Text samples in SVT-Perspective are picked from side view angles in Google Street View, thus most of them are heavily deformed by perspective distortion.SVT-Perspective中的文本样本是从Google街景中的侧视角中选取的，因此大多数文本样本都因透视变形而严重变形。
129	SVTPerspective (1)		Furthermore, recall the results in Tab. 1, on SVTPerspective RARE outperforms [32] by a even larger margin.此外，回顾表1中的结果，在SVTP上，RARE的表现优于[32]达到更大的余地。
130	gray-scale (1)	[ɡ'reɪsk'eɪl]	The middle column is the rectified images (we use gray-scale images for recognition).中间一列是校正后的图像(我们使用灰度图像进行识别)。
131	mistakenly (1)	[mɪ'steɪkənlɪ]	Green and red characters are correctly and mistakenly recognized characters, respectively.绿色和红色字符分别是正确和错误识别的字符。
132	qualitative (1)	[ˈkwɒlɪtətɪv]	In fig. 9 we present some qualitative analysis.在图9中，我们提出了一些定性分析。
133	artistic-style (1)	[!≈ ɑ:ˈtɪstɪk staɪl]	Curved text is a commonly seen artistic-style text in natural scenes.弯曲文本是自然场景中常见的艺术风格文本。
134	advantageous (1)	[ˌædvənˈteɪdʒəs]	Therefore, it is advantageous on this task.因此，在这项任务上是有利的。
135	acknowledgment (1)	[ək'nɒlɪdʒmənt]	Acknowledgments致谢
136	NSFC (1)	[!≈ en es ef si:]	This work was primarily supported by National Natural Science Foundation of China (NSFC) (No. 61222308, No. 61573160 and No. 61503145), and Open Project Program of the State Key Laboratory of Digital Publishing Technology (No. F2016001).本工作主要得到国家自然科学基金(61222308，61573160，61503145)和数字出版技术国家重点实验室开放项目(No. F2016001)的支持。
137	F2016001 (1)		This work was primarily supported by National Natural Science Foundation of China (NSFC) (No. 61222308, No. 61573160 and No. 61503145), and Open Project Program of the State Key Laboratory of Digital Publishing Technology (No. F2016001).本工作主要得到国家自然科学基金(61222308，61573160，61503145)和数字出版技术国家重点实验室开放项目(No. F2016001)的支持。

Words List (appearance)

Words List (frequency)